Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2026 Feb 14;2025:754–763.

Relational Database-Based Resource-Provenance Visualization Engine: with an application to BICAN data

Xiaojin Li 1, Yan Huang 1, Lydia Ng 2, Kimberly A Smith 2, Wei-Chun Chou 1, Rashmie Abeysinghe 1, Ling Tong 1, Shiwe Lin 1, Licong Cui 1, Shiqiang Tao 1,*, Guo-Qiang Zhang 1,*
PMCID: PMC12919532  PMID: 41726486

Abstract

Provenance tracking ensures data integrity, security, and accountability in healthcare and biomedical research. As biomedical data grows in complexity, comprehensive tracking mechanisms are needed to maintain reproducibility, transparency, and compliance with regulatory standards, such as GDPR. Traditional log-based and ontology-based approaches capture and standardize data lineage, while cryptographic and blockchain-based methods enhance security and verifiability. However, challenges remain in scalability, security, and usability. To address these, we introduce the Resource-Provenance Visualization Engine (RPVE), an advanced system integrating data lineage tracking and interactive visualization. RPVE employs the Randomized N-gram Hashing Identifier (NHash ID) to establish precise data links within the BRAIN Initiative Cell Atlas Network (BICAN) and features an interactive Sankey visualization engine for seamless data exploration. The system enhances provenance tracking by improving data retrieval efficiency, ensuring reliable verification processes, and maintaining data integrity.

1. Introduction

Provenance tracking plays a crucial role in healthcare and biomedical research by ensuring data integrity, security, and accountability [1]. As biomedical research advances, vast amounts of complex and interconnected data are generated, including biological samples, imaging datasets, genomic sequences, and experimental results. To maintain the reliability and reproducibility of these studies, it is essential to establish comprehensive provenance tracking mechanisms that allow researchers to trace the complete lineage of data [2]. This ensures that every transformation and interaction is recorded, fostering trust and transparency in biomedical investigations [3]. Moreover, provenance tracking enhances compliance with regulatory standards of the General Data Protection Regulation (GDPR) [4], ensuring that sensitive data is handled with accountability and ethical responsibility.

Existing approaches to data provenance tracking in healthcare and biomedical research have demonstrated significant progress [2]. Traditional log-based tracking systems rely on capturing metadata from data processing pipelines, enabling the reconstruction of data transformations [5, 6]. Several ontology-based provenance technologies have been proposed to enhance data provenance by ensuring standardized representation, transparency, and interoperability [711]. Additionally, cryptographic-based technologies [1215] have been introduced to strengthen data integrity and security by preventing unauthorized modifications and ensuring authenticity. In recent years, blockchain-based provenance tracking has gained attention as a promising approach, utilizing the permanence of blockchain ledgers to securely record and verify data transactions [1621]. Public blockchains face challenges with GDPR compliance and scalability, while web- and cloud-based healthcare systems handle these issues efficiently but risk a single point of failure. A hybrid approach combining web, cloud, and blockchain technologies may offer a solution [4].

Several challenges persist in resource provenance tracking. Scalability remains a critical issue, as biomedical datasets often consist of millions of interconnected records that must be efficiently stored, processed, and retrieved [1, 22]. Security is another major concern, as provenance records must remain tamper-proof to comply with regulatory requirements and maintain research credibility [4]. Furthermore, many existing systems fail to provide intuitive visualization interfaces [23], making it difficult for researchers to explore and interact with complex lineage structures. These challenges highlight the need for a more efficient, scalable, and user-friendly provenance tracking system tailored specifically for biomedical research applications.

In this paper, we present the Resource-Provenance Visualization Engine (RPVE), an advanced system that integrates cutting-edge data tracking and visualization techniques. At the core of our system is the Randomized N-gram Hashing Identifier (NHash ID, RRID:SCR 025313) [24], a unique tracking mechanism that establishes precise links between resource items in the BRAIN Initiative Cell Atlas Network (BICAN, RRID:SCR 022794) [25] from their origin to their final data outputs. The NHash ID ensures consistent and accurate tracking of data lineage throughout the research process. Additionally, we develop the Sankey visualization engine, which facilitates interactive visualizations, allowing users to trace data flow seamlessly from ancestors to descendants. These innovations enhance provenance tracking by improving data retrieval efficiency, streamlining verification processes, and preventing the propagation of errors in biomedical research. Moreover, the interactive visualization interface reduces the technical expertise required, making provenance tracking more accessible to researchers. The integration of RPVE with BICAN data has resulted in significant improvements in both the efficiency and accuracy of provenance tracking. By leveraging the NHash ID, our system minimizes data inconsistencies, ensuring reliable verification of research outcomes. The Sankey visualization further enhances the clarity of data flow representation, enabling researchers to navigate complex dependencies with ease. Our findings demonstrate that this approach enhances research efficiency, accelerates data-driven discoveries, and bolsters confidence in the integrity of biomedical data. By providing a comprehensive, scalable, and user-friendly framework for provenance tracking, our system effectively addresses key challenges in biomedical research, ensuring greater transparency, accuracy, and accessibility for researchers working with BICAN data.

2. Background

2.1. BRAIN Initiative Cell Atlas Network (BICAN)

The National Institutes of Health (NIH) Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative (RRID:SCR 006770) [26, 27] launched in 2013, is a landmark effort to advance understanding of the human brain. It focuses on developing cutting-edge technologies to map neural circuits, uncover brain functions, and explore the molecular basis of cognition, behavior, and neurological disorders. By promoting interdisciplinary collaboration, the initiative brings together experts across neuroscience, engineering, and computational fields to create tools for probing the brain with unprecedented precision [25].

The BRAIN Initiative Cell Atlas Network (BICAN) [25] is a critical component of the NIH BRAIN Initiative, with the ambitious goal of creating detailed reference brain cell atlases that will serve as an invaluable resource for the global research community. These atlases are designed to provide a molecular and anatomical framework that supports the study of brain function, structure, and disorders. Building on the foundational work of the BRAIN Initiative Cell Census Network, BICAN takes the next step in advancing our understanding of brain cells and neural circuits by focusing on mapping these cells across multiple species, with a particular emphasis on humans.

The primary objective of BICAN is to generate comprehensive, reference-quality atlases of brain cell types across human, non-human primate, and mouse brains throughout the lifespan. These atlases aim to provide a complete catalog of the diverse range of neurons and non-neuronal cells present in the brain, enabling researchers to explore the complex cellular composition of the brain and understand how these cells interact within neural circuits. In addition to identifying and categorizing the various brain cells, BICAN aims to map how these cells interact and form networks that underlie a wide range of brain functions and disorders. By understanding the cellular interactions involved in cognition, behavior, and disease, BICAN’s work is expected to uncover new insights into the molecular and cellular bases of neurological and psychiatric conditions, such as Alzheimer’s disease, autism, and schizophrenia [25].

2.2. Provenance Tracking Technologies

Several technologies have been proposed to address the challenges of data provenance, each focusing on different aspects of data tracking, security, and integrity [4]. Logging-based technologies primarily utilize log files for the collection, management, and analysis of provenance data. These systems either store logs in centralized nodes or separate provenance data files, and some incorporate techniques to track events in cloud systems, smart contracts, or system-level operations. Notably, Tan [28] emphasizes the reconstruction of log files to model provenance relations, while Kaaniche et al. [29] introduce the use of blockchain-based logs to capture provenance change events in smart contracts, offering additional transparency and security.

Ontology-based technologies utilize semantic web technologies and formal ontologies to represent, manage, and standardize the capture of provenance information. These systems enable the structured representation of provenance data, facilitating its integration and interoperability across diverse domains. For instance, Dalpra et al. [10] propose the use of the PROV-O ontology (RRID:SCR 010415) to capture software process data provenance, thereby enhancing product quality. Additionally, Can et al. [11] use semantic technologies to detect privacy violations in the healthcare domain, underscoring the potential of ontology-based systems to reduce privacy risks. The use of provenance vocabulary enables the formalization and tracking of data provenance in fields like web publishing and genetics [8, 9].

Cryptographic-based technologies focus on safeguarding data provenance against tampering and malicious attacks through encryption, watermarking, and other advanced cryptographic techniques. These approaches are critical in ensuring data authenticity and protecting the integrity of provenance information. For instance, Priyadharshini et al. [12] propose a cryptographic provenance verification method to prevent fake keystroke injections, while Sultana et al. [13] present a lightweight scheme for securely transmitting sensor data provenance, aimed at mitigating the risks posed by adversarial attacks in sensitive environments like the Internet of Things.

Blockchain-based technologies leverage decentralized ledgers to provide secure and transparent data provenance. Blockchain technology ensures that all data operations are traceable and verifiable, thus enhancing data integrity without reliance on centralized authorities. For example, Javaid et al. [30] discuss the use of blockchain with smart contracts to achieve data provenance and integrity, while Sigwart et al. [31] addresses scalability and privacy concerns in IoT environments. Blockchain’s decentralized nature makes it particularly suitable for applications in healthcare, where traceability and data security are of paramount importance, as demonstrated by existing works [1921], which apply blockchain to track drug traceability and personal health data, respectively.

3. Methods

3.1. System Design and Architecture

We designed the system architecture, as illustrated in Figure 1, with four core components. 1) A relational database with NHash identifiers (Figure 1.A) stores various resource tables and their relationships, enabling efficient data retrieval and management. 2) A Sankey visualization engine built with Ruby on Rails (Figure 1.B) translates user-defined parameters into renderable JSON data through four functional modules: a search and filtering module for extracting relevant data, a data reorganization module for structuring it, a hierarchical decomposition module for layered organization, and an adaptive scaling module for dynamic adjustments based on dataset size. 3) An interactive user interface (Figure 1.C) allows users to customize Sankey diagram parameters for flexible visualization, and 4) A well-structured Application Programming Interface (API) facilitates seamless interaction between users and the Sankey diagram generation process. These components enable efficient data processing, interactive visualization, and a user-friendly experience for Sankey diagram customization.

Figure 1:

Figure 1:

The system architecture of RPVE.

3.2. Data Model with Randomized N-gram Hashing Identifier (NHash ID)

NHash [24] is a novel solution designed to overcome the limitations of traditional identifier systems, such as Universally Unique Identifier (UUID), in large-scale collaborative research projects. Unlike traditional systems, NHash employs a blockchain-style provenance tracking mechanism that ensures data integrity and clear lineage management. Using a randomized N-gram hashing process, NHash generates unique and verifiable identifiers that are resistant to collisions, reducing the risk of data mismatches. Adding appropriate prefix design, NHash identifiers provide human-readable information, making it easier to interpret resource types, such as tissue samples or donor data, simplifying the management of diverse research resources and improving data traceability across complex projects. Traditional identifier systems face challenges in large-scale research environments, particularly in linking and verifying data across multiple stages and locations. These systems often lack the robustness needed for effective data linkage and fail to provide verifiable confirmation of the resource they represent. NHash addresses these issues by ensuring that identifiers are not only unique and secure but also easily traceable and identifiable, facilitating resource tracking and validation. It focuses on data provenance, collision resistance, and human-readable identifiers, providing a scalable and reliable solution for managing and tracking a wide range of research resources, improving accuracy and transparency throughout the research lifecycle.

The NHash ID generation process consists of two key phases. First, several N-grams are generated using a resource’s metadata and its up-stream resource’s metadata. These N-grams are secure cryptographic hashes resistant to reverse-engineering. In the second phase, the N-grams are concatenated with a random seed number and further transformed using cryptographic techniques like a shift cipher, which encrypts the data, adding an extra layer of security and collision prevention. The resulting identifier is unique and validates the connection between the resource, the up-stream resource, and the random seed.

We incorporate a prefix into the NHash ID to indicate the associated resource table, as shown by examples like RA-XXX, RB-XXX, and RC-XXX in Figure 1.A. This approach improves data organization and enables efficient identification of entities across tables. The six-step process for generating a donor’s NHash ID in BICA is illustrated in Figure 2. Starting with a donor’s local subject ID and project number, the input is first sanitized by replacing punctuation and numbers with alphabetic characters. Then, a random number R between 0 and 10,000 is generated. In Step 3, a prefix and two bigrams are derived from the resource type, subject ID, and project number, with N-gram positions determined via modular arithmetic. Step 4 concatenates the prefix and bigrams, followed by encryption in Step 5 using methods like shift-cipher. Finally, Step 6 combines the resource type abbreviation, encrypted string, and random number to form the NHash ID. A duplication check ensures uniqueness; if a collision is detected, Step 2 is repeated to generate a new NHash ID.

Figure 2:

Figure 2:

The NHash ID Generation Algorithm: a donor example in BICAN

A relational database with NHash IDs serves as the foundation for data storage and management, ensuring structured organization and efficient retrieval of interconnected resources. This database is designed to store multiple resource tables, each representing different types of entities, such as donors, brain specimens, tissues, and cell samples, along with their associated attributes, including timestamps, metadata, and unique identifiers. These tables are interconnected through well-defined relationships, such as one-to-one, one-to-many, or many-to-many associations, enabling seamless data linking and retrieval. The database schema ensures data consistency, normalization, and optimized indexing, allowing for efficient querying and retrieval of large datasets. Additionally, the use of NHash IDs improves data security and indexing efficiency by generating unique, randomized hash values through n-gram tokenization, which helps minimize redundancy and optimize query performance. This structured relational model enables seamless integration with the visualization engine, ensuring that data can be rapidly accessed, filtered, and transformed into the required visualization format. By leveraging this approach, the system supports scalable, high-performance data operations while maintaining the integrity and consistency of stored information.

3.3. Sankey Visualization Engine

Sankey diagrams are a powerful visualization tools that represent flow dynamics between multiple entities. They are particularly useful in provenance visualization, where understanding the movement, transformation, and lineage of data is essential. Provenance, in this context, refers to the historical record of data, processes, or objects, detailing their origins, transformations, and dependencies. One of the primary advantages of using Sankey diagrams for provenance visualization is their ability to depict the proportional distribution of data as it moves through different stages of a process. The flow lines in a Sankey diagram reflect the magnitude of data or resources being transferred, providing an intuitive representation of changes and dependencies. Sankey diagrams are particularly relevant in domains such as scientific computing, data processing pipelines, cybersecurity, and digital forensics. In scientific workflows, researchers use them to track the lineage of datasets, ensuring the reproducibility of experiments.

Despite their advantages, Sankey diagrams also present challenges in provenance visualization. As the complexity of data flows increases, the readability of the diagram can suffer, leading to overlapping paths and visual clutter. To address these challenges, we employed several strategies, including hierarchical decomposition, adaptive scaling, and interactive visualization, as shown in Figure 1.B.

Hierarchical decomposition breaks down large and complex data flow structures into smaller, more manageable subcomponents, helping to manage complexity. By organizing provenance data across different levels of depth, users can focus on specific layers without being overwhelmed by excessive information from other levels. This method enables a step-by-step exploration process, allowing users to delve into deeper details only when necessary. As a result, hierarchical decomposition enhances the structured organization of provenance data and improves users’ ability to comprehend relationships and transformations at varying levels of granularity.

Adaptive scaling ensures that Sankey diagrams remain clear and readable even as the complexity of the source data grows. This approach dynamically adjusts the layout and flow representations to optimize space usage while reducing visual clutter. Techniques such as node reduction and adaptive positioning help streamline the diagram’s structure, preserving the visibility of critical relationships and trends. By modifying the display based on data density, adaptive scaling enhances interpretability, ensuring that key insights remain accessible regardless of the diagram’s complexity.

Interactive visualization further improves usability by enabling users to dynamically manipulate Sankey diagrams. We develop a Sankey diagram visualization using the Highcharts JavaScript library (RRID:SCR 016095) [32], incorporating features such as zooming, path highlighting, and node metadata display. These enhancements improve interactivity, allowing users to explore data flow in a flexible and intuitive manner. This shift transforms the Sankey diagram from a static representation into a dynamic, exploratory tool, facilitating a deeper understanding of data transformations and dependencies.

3.4. Interactive User Interface

An interactive user interface (UI), as illustrated in Figure 1.C, enables users to customize Sankey diagram parameters such as resource type, resource item, and depth for more flexible and focused data visualization. Filtering by resource type lets users narrow the diagram to specific categories based on their role or relationship in the dataset. Selecting a resource item helps target particular records of interest, such as input data sources in a provenance diagram, making it easier to navigate large datasets. Integrated search functions further enhance usability by allowing quick identification of items or relationships based on attributes or labels. Adjustable depth settings allow users to control the level of detail, either displaying high-level overviews or expanding specific branches for finer granularity. This progressive exploration approach helps manage visual complexity by revealing detailed layers only when needed.

These interactive features are further enhanced through real-time updates, allowing users to dynamically modify visualization elements, ensuring that changes are immediately reflected within the diagram. To develop such effective interactive UI, it is essential to incorporate fundamental features such as drag-and-drop controls, real-time updates, search and filtering capabilities, and integration with live data sources. In this way, users can interact with resource items or processes to highlight specific paths, isolate selected sections, and simultaneously display all connected processes, thereby facilitating a clear and intuitive understanding of the relationships within the data.

3.5. Application Programming Interface (API)

To support stable and flexible data visualization across the BICAN community, we developed a modular API architecture with three core components: 1) Data Querying API, 2) Hierarchical Relationship API, and 3) Sankey Diagram Generation API. These components enable efficient data exchange, user input processing, and dynamic rendering, ensuring seamless interaction with the visualization system. The Data Querying API uses NHash IDs to retrieve resource items efficiently, with minimal overhead. It supports JSON-based queries and integrates with external applications, enhancing provenance tracking and lineage analysis. The Hierarchical Relationship API provides endpoints to traverse parent-child, sibling, and descendant relationships, with configurable depth and filters to balance detail and performance. This supports accurate Sankey diagram construction and complex dependency analysis. The Sankey Diagram Generation API converts user-defined parameters into visualizations through automated aggregation and rendering. It also supports saving and retrieving configurations for future use. Together, these APIs streamline data exploration, enabling researchers to derive insights and make evidence-based decisions through interactive visualizations.

API development faces two key challenges: over-fetching and under-fetching. Over-fetching occurs when APIs return excessive data, leading to inefficiencies. For example, in BICAN, a brain bank personnel may need donor details while a sequencing center personnel requires brain structure data. Traditional APIs often return all available data, causing redundancy [33]. Under-fetching happens when APIs fail to retrieve all necessary data, requiring multiple queries [34]. For instance, in BICAN, a traditional Restful API may require multiple queries to retrieve a tissue sample along with its related resources, leading to increased latency and reduced performance [35]. To address over- and under-fetching, we implemented a micro-application architecture (Figure 3) that separates the user interface into multiple independent front-end applications. The Provenance Tracking Interface (Section 4.2) is one example, with others supporting data entry and retrieval. A central container application manages authentication and routes users based on roles encoded in JWT tokens. Each micro-application is tailored to a specific user group and connects to dedicated backend APIs aligned with its workflow. This design ensures only the necessary data is retrieved for each task, reducing redundancy and improving performance. By distributing responsibilities across modular components, the architecture enhances scalability and delivers a more efficient, role-specific user experience.

Figure 3:

Figure 3:

The micro-application architecture.

4. Results

4.1. BICAN Data Repository

The Data Repository we have built for BICAN serves as a centralized and structured storage system, designed to efficiently organize and manage diverse resource types. As of today, it comprises 11 distinct resource types (depicted as purple blocks in Figure 4), representing an experimental workflow towards generating single cell genomics data. Donors are entered into RPVE with an extensive set of clinical pathology metadata by Brain Banks. Postmortem brain specimens are divided into thick (typically coronal) Brain Slabs. Brain Slab images are uploaded through a UI with tools to rotate and crop to individual slabs. Tissue can be requested from a Brain Bank by drawing a Region of Interest (ROI) on the slab images. Multiple Tissues can be combined and processed by a Library Lab to create a Dissociated Cell Sample. Optionally, the cell sample can go through an enrichment step to select for specific cell types, yielding an Enriched Cell Sample. The cell barcoding step adds a molecular barcode to each individual cell of the input cell sample. Portions of the Barcoded Cell Sample are then used to create one or more Libraries for sequencing. For gene expression analysis, amplification of the input material before Library generation is needed to have accurate and reliable sequencing (Amplified cDNA). Finally, a portion of a Library (Library Aliquot) is combined with other Library Aliquots to form a Library Pool. Each library aliquot in a library pool will have a unique index to allow them to be sequenced together and then demultiplexed, generating a set of FASTQ files ready for alignment for each library aliquot.

Figure 4:

Figure 4:

The 11 distinct resource types in BICAN.

The repository is structured as a relational database, accurately capturing the hierarchical and interconnected relationships among resource types through an entity-relation model, as also depicted in Figure 4. For instance, a single donor is linked to multiple slabs, each slab contains multiple ROIs, and each ROI encompasses multiple tissues, forming a well-defined parent-child hierarchy. A relational database was chosen because the dataset’s structure is predictable, tabular, and well-suited to clearly specify entities and relationships. This model ensures strong data integrity through primary and foreign key constraints, supports complex queries via standardized SQL, and provides reliable transactional consistency, which is critical for managing clinical and research data. While NoSQL or graph databases may be beneficial for unstructured or highly dynamic data, the relational approach enables efficient data retrieval, integrity enforcement, and scalability, facilitating seamless navigation across interconnected datasets.

In addition, we created a unique NHash ID for each resource item, which is critical for maintaining data consistency and enabling precise tracking and querying of resources. Currently, the Data Repository integrates brain specimen and sequencing data derived from 16 separate NIH-funded projects (4 U24, 4 UM1, 3 U01, 1 U19, 4 R01) and NIH NeuroBioBank (RRID:SCR 003131), ensuring a rich and diverse dataset for analysis and research. By leveraging relational database principles, it provides a comprehensive and scalable framework for storing, querying, and analyzing complex biomedical data, supporting advanced visualization and computational processing. Table 1 presents the number of records for each resource type in the repository. Overall, the repository currently contains nearly 100,000 records, with the total continuing to grow through ongoing daily data entry.

Table 1:

NHash ID within BICAN data.

Resource Type Prefix Total Count
Donor DO- 1,230
Slab SL- 8,593
ROI RI- 21,903
Tissue TI- 12,196
Dissociated Cell Sample DC- 5,426
Enriched Cell Sample EC- 4,859
Barcoded Cell Sample BC- 10,168
Amplified cDNA AC- 7,835
Library LI- 13,191
Library Aliquot LA- 13,200
Library Pool LP- 778

4.2. Provenance Tracking Interface

The Provenance Tracking Interface is designed to provide users with a comprehensive and intuitive way to explore resource relationships and metadata within the BICAN Data Repository. This interface is structured into two main components: 1) A user interface for interactive exploration and an API for seamless data retrieval and integration. The UI component enables users to navigate complex datasets efficiently through parameter configuration and interactive visualization, and 2) The API component provides a streamlined and efficient method for programmatically accessing resource items and their relationships. RPVE was developed using the Ruby on Rails framework (RRID:SCR 022129), with MySQL (RRID:SCR 025972) serving as the backend database. Data visualizations were implemented using the Highcharts JavaScript library [32].

4.2.1. User Interface

The UI of RPVE was designed to provide an intuitive platform for exploring complex resource relationships. As illustrated in Figure 5, the UI consists of two primary components: a Sankey configuration area (Figure 5.A) and a Sankey visualization area (Figure 5.B). The configuration area enables users to refine their data retrieval process through six key parameters: 1) Visualization Type, allowing users to define the direction of data retrieval, either downstream to display descendant resource items (Downstream” button) or upstream to reveal parent resource items (Upstream” button); 2) Resource Type, providing a selection of 11 predefined resource types to streamline the identification of relevant records; 3) Search by NHash ID, allowing users to locate a specific resource item using its unique NHash ID, which inherently encodes the resource type; 4) Search by Local ID, offering an alternative search method based on locally assigned identifiers generated by different projects; 5) Maximum Tracking Depth, enabling users to limit the hierarchical depth of the Sankey visualization, thereby managing the complexity of displayed relationships; and 6) Label Display, which enables users to switch node labels between NHash and local ID without regenerating the diagram.

Figure 5:

Figure 5:

The user interface of RPVE.

The visualization area dynamically constructed a Sankey diagram based on user-configured parameters, effectively rendering the provenance relationships of the selected resource item. As shown in Figure 5.B, the system provided an upstream tracking visualization for the library pool with NHash ID of LP-ROBCPL938294. In this diagram, nodes represent individual resource items, while edges indicate the hierarchical relationships between them. This visualization allows users to trace the lineage of a resource back to its origin, offering a comprehensive view of the data flow and dependencies within the repository. Specifically, Figure 5.B illustrates that tissue samples from seven brain slabs of donor DO-IKOQ4823 contributed to the sequencing analysis, with each processing step clearly recorded and represented. Each path in the diagram corresponds to the actual sequencing workflow, capturing distinct stages such as tissue dissection, dissociation, enrichment, barcoding, amplification, library preparation, and pooling. This hierarchical structure ensures that all transformations, from donor registration to sequencing preparation, are transparently documented and visually interpretable.

To enhance user interaction and analytical capabilities, the visualization area incorporated several interactive features. Hovering over a node reveals node information and highlights all associated paths, allowing users to quickly identify provenance relationships and efficiently track resource flows. Clicking on a node displays detailed metadata, which can be examined directly within the interface or exported in JSON format for further analysis. Additionally, the control center (Figure 5.B.1) provided essential functionality such as zooming for detailed exploration and the ability to download the entire Sankey diagram in JPG format, supporting straightforward documentation and data sharing. By integrating real-time interactivity, structured provenance representation, and flexible data export options, the visualization area offers a comprehensive tool for researchers to analyze, interpret, and validate BICAN datasets effectively.

4.2.2. Application Programming Interface

The API component provides programmatic access to the RPVE system through three specialized endpoints. All API endpoints are secured with JWT tokens which also serve as user role and micro-application identifiers. The Data Querying API accepts HTTP GET requests with parameter NHash ID (“id”), returning JSON objects with corresponding accessible resource information. For example, a request to /info?id=DO-IKOQ4823 will locate a record in Donor table and return necessary data according to the token. The Hierarchical Relationship API facilitates relationship exploration through endpoints like /descendants?id=DO-IKOQ4823&max_depth=5. The endpoint will trace down to all associated resources with the donor based on the database schema and use parameter “max depth” to control the depth of the search. The API returns structured JSON representing resource hierarchies with parent-child relationships and metadata at each level. Finally, the Sankey Diagram Generation API accepts parameters NHash ID (“id”) and “max depth” through GET requests such as /sankey_visualization_downstream?id=DO-IKOQ4823&max_depth=5, producing a complete Sankey-compatible JSON structure that defines nodes, links, values, and visualization properties. This structured output can be directly consumed by front-end visualization libraries or exported for external analysis.

Each API component implements authentication through API keys, rate limiting to prevent abuse, and comprehensive error handling with standardized response codes and descriptive messages. The APIs have been widely adopted by the BICAN community, enabling researchers to access, explore, and visualize complex resource relationships within the Data Repository. The APIs have been integrated into various external systems and applications, such as The Neuroscience Multi-omic Archive (NeMO, RRID:SCR 016152) [36] and the Brain Image Library (BIL, RRID:SCR 017272) [37], providing seamless data exchange and visualization capabilities across different research projects. By offering a flexible and scalable API architecture, the RPVE system supports diverse user needs, enhances data accessibility, and fosters collaboration within the BICAN research community.

5. Discussion

The RPVE system, as presented in this paper, provides an effective solution for metadata storage, provenance tracking, and data visualization, addressing key challenges in data management and data integrity verification. The system’s design incorporates important features that contribute to its functionality and scalability in managing complex datasets.

RPVE leverages the capabilities of relational databases to store metadata and manage relationships between different resource types. This approach ensures an organized and structured framework for tracking metadata, which is essential for datasets involving intricate relationships. A significant feature of RPVE is its use of a self-developed NHash ID for each resource item, instead of the more commonly used UUID. The NHash ID serves the dual purpose of a unique identifier and facilitates provenance tracking. This design choice improves the system’s ability to track the historical context of data, allowing for more accurate and detailed provenance tracking.

RPVE also includes a UI that integrates an interactive, real-time updating Sankey diagram, responsive to user-configured parameters. Traditional Sankey diagrams, which are static, may not adequately represent dynamic data flows. In contrast, the real-time updating feature of RPVE enhances the flexibility and utility of the diagram, making it more informative. Additionally, the interactive components of the UI allow users to explore and analyze data more intuitively, improving the overall user experience. In addition to the UI, RPVE offers different APIs that enable users to interact with the system through their own applications. This enables users to integrate RPVE with their existing tools or develop custom applications tailored to their specific requirements. The API provides additional flexibility, making the system adaptable to a wide range of use cases and user needs.

RPVE has demonstrated substantial practical value across a range of research applications, particularly in the context of the BICAN project. It provides robust support for tracking data provenance, enabling users to trace complex resource lineages with precision. For example, as shown in Figure 5, Library Pool LP-ROBCPL938294 can be traced back to the originating donor DO-IKOQ4823, including all intermediate processing steps. This capability enhances transparency and facilitates reproducibility in multi-step experimental workflows. In addition, RPVE enables systematic verification of data integrity by allowing users to visually assess the completeness and consistency of resource relationships. This function is routinely employed in BICAN for post-ingestion validation, ensuring that submitted datasets are accurate, well-structured, and fully documented. RPVE is particularly effective for managing datasets that undergo continuous updates, providing tools to maintain traceability and data quality throughout the lifecycle of the repository. Furthermore, the system supports standardized export of metadata and provenance structures in formats such as JSON and JPG, thereby streamlining the generation of project documentation, academic reports, and supplementary materials for publications. These export functionalities facilitate alignment between data and written content and promote interoperability with external analytic tools. Collectively, RPVE’s integrated querying, visualization, and export capabilities enhance the reliability, transparency, and reusability of complex biomedical datasets in collaborative research environments.

While the current paper focuses on RPVE’s application to the BICAN dataset, the system is designed to be scalable and can be applied to other relational databases for provenance tracking. The modular design of RPVE ensures its adaptability to different datasets, extending its potential across various domains. This scalability indicates that RPVE could be effectively applied in fields such as bioinformatics, social sciences, and any area that requires provenance tracking and data verification.

Limitations and Future Work. While RPVE offers solutions for metadata storage, provenance tracking, and data visualization, it has several limitations. Currently focused on the BICAN dataset, adapting it to other datasets may require customization for compatibility with different database structures and metadata types. RPVE may also face performance issues with large datasets or highly interrelated resources due to slower query speeds in relational databases. Future work will improve scalability, performance, and the system’s ability to handle complex data. Expanding testing across diverse datasets will enhance adaptability and help identify optimization opportunities. Additionally, enhancing the API for better customization and refining the user interface, particularly the Sankey diagram, will improve usability for more complex data visualization. We will also provide an online demo and a comprehensive user manual to support adoption and ease of use.

6. Conclusion

In this paper, we propose the RPVE as a framework for metadata management, provenance tracking, and data visualization in biomedical research. By leveraging the NHash ID for precise data lineage tracking and an interactive Sankey visualization engine for real-time data exploration, RPVE ensures accurate and reliable tracking of data transformations. The system’s structured metadata management through relational databases provides consistent provenance tracking, while its API integration enhances adaptability across various research environments. Applied to the BICAN dataset, RPVE has demonstrated improvements in data retrieval efficiency, integrity verification, and research reproducibility. Additionally, its export functionalities facilitate seamless documentation and reporting. RPVE’s modular and scalable design allows for its adaptation to diverse domains requiring provenance tracking and data integrity verification, reinforcing its significance in advancing data-driven research.

Acknowledgement.

This work was supported in part by the National Institutes of Health (NIH) grant U24MH130988. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Figures & Tables

References

  • [1].Gierend K, Kru¨ger F, Genehr S, Hartmann F, Siegel F, Waltemath D, Ganslandt T, Zeleke A. A. Capturing provenance information for biomedical data and workflows: A scoping review. Research Square. 2023. PREPRINT (Version 1)
  • [2].Sembay M. J, Macedo D. D. J de, Ju´nior L. P, Braga R. M. M, Sarasa-Cabezuelo A. Provenance data management in health information systems: a systematic literature review. Journal of Personalized Medicine. 2023;13(6):991. doi: 10.3390/jpm13060991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Johns M, Meurers T, Wirth F. N, Haber A. C, Mu¨ller A, Halilovic M, Balzer F, Prasser F. Data provenance in biomedical research: scoping review. Journal of medical Internet research. 2023;25:e42289. doi: 10.2196/42289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Ahmed M, Dar A. R, Helfert M, Khan A, Kim J. Data provenance in healthcare: approaches, challenges, and future directions. Sensors. 2023;23(14):6495. doi: 10.3390/s23146495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Jiang W, Hu C, Pasupathy S, Kanevsky A, Li Z, Zhou Y. Understanding customer problem troubleshooting from storage system logs. Proccedings of the 7th conference on File and storage technologies. 2009:p. 43–56. [Google Scholar]
  • [6].Suen C. H, Ko R. K, Tan Y. S, Jagadpramana P, Lee B. S. 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications. IEEE; 2013. S2logger: End-to-end data tracking mechanism for cloud data provenance; pp. p. 594–602. [Google Scholar]
  • [7].Ram S, Liu J. Proceedings of the First International Workshop on the Role of Semantic Web in Provenance Management. Vol. 526. Washington, DC, USA: 2009. A new perspective on semantics of data provenance. [Google Scholar]
  • [8].Sahoo S. S, Sheth A. P. Microsoft eScience Workshop. Pittsburgh, PA, USA: 2009. Provenir ontology: Towards a framework for escience provenance management. [Google Scholar]
  • [9].Hartig O, Zhao J. Provenance and Annotation of Data and Processes: Third International Provenance and Annotation Workshop, IPAW 2010, Troy, NY, USA, June 15-16, 2010. Revised Selected Papers 3. Springer; 2010. Publishing and consuming provenance metadata on the web of linked data; pp. p. 78–90. [Google Scholar]
  • [10].Dalpra H. L, Costa G. C. B, Sirqueira T. F. M, Braga R, Campos F, Werner C. M. L, David J. M. N. Proceedings of the Brazilian Seminar on Ontologies. Sa˜o Paulo, Brazil: 2015. Using ontology and data provenance to improve software processes. [Google Scholar]
  • [11].Can O, Yilmazer D. Improving privacy in health care with an ontology-based provenance management system. Expert Systems. 2020;37(1):e12427. [Google Scholar]
  • [12].Priyadharshini M. D, Ananth C. A secure hash message authentication code to avoid certificate revocation list checking in vehicular adhoc networks. International Journal of Applied Engineering Research (IJAER) 2015;10:1250–1254. [Google Scholar]
  • [13].Sultana S, Ghinita G, Bertino E, Shehab M. 2012 IEEE 18th International Conference on Parallel and Distributed Systems. IEEE; 2012. A lightweight secure provenance scheme for wireless sensor networks; pp. p. 101–108. [Google Scholar]
  • [14].Wang C, Bertino E. Sensor network provenance compression using dynamic bayesian networks. ACM Transactions on Sensor Networks (TOSN) 2017;13(1):1–32. [Google Scholar]
  • [15].Yang Y, Liu X, Guo W, Zheng X, Dong C, Liu Z. Multimedia access control with secure provenance in fog-cloud computing networks. Multimedia Tools and Applications. 2020;79(15):10701–10716. [Google Scholar]
  • [16].Liang X, Shetty S, Tosh D, Kamhoua C, Kwiat K, Njilla L. 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) IEEE; 2017. Provchain: A blockchain-based data provenance architecture in cloud environment with enhanced privacy and availability; pp. p. 468–477. [Google Scholar]
  • [17].Neisse R, Steri G, Nai-Fovino I. A blockchain-based approach for data accountability and provenance tracking. Proceedings of the 12th international conference on availability, reliability and security. 2017:p. 1–10. [Google Scholar]
  • [18].Shetty S, Red V, Kamhoua C, Kwiat K, Njilla L. Disruptive Technologies in Sensors and Sensor Systems. Vol. 10206. SPIE; 2017. Data provenance assurance in the cloud using blockchain; pp. p. 125–135. [Google Scholar]
  • [19].Musamih A, Salah K, Jayaraman R, Arshad J, Debe M, Al-Hammadi Y, Ellahham S. A blockchain-based approach for drug traceability in healthcare supply chain. IEEE access. 2021;9:9728–9743. [Google Scholar]
  • [20].Kakade S. V, Tiple B, Sansuddi A. S, Kokate M. D, Alate M. M, Chavan S. Blockchain-based medical record sharing in healthcare iot: Building trust and transparency through secure provenance tracking. Journal of Electrical Systems. 2023;19(3) [Google Scholar]
  • [21].Sun L, Liu D, Li Y, Zhou D. A blockchain-based e-healthcare system with provenance awareness. IEEE Access. 2024;12:110098–110112. [Google Scholar]
  • [22].Wang J, Crawl D, Purawat S, Nguyen M, Altintas I. 2015 IEEE international conference on big data (Big Data) IEEE; 2015. Big data provenance: Challenges, state of the art and opportunities; pp. p. 2509–2516. [Google Scholar]
  • [23].Borkin M. A, Yeh C. S, Boyd M, Macko P, Gajos K. Z, Seltzer M, Pfister H. Evaluation of filesystem provenance visualization tools. IEEE transactions on visualization and computer graphics. 2013;19(12):2476–2485. doi: 10.1109/TVCG.2013.155. [DOI] [PubMed] [Google Scholar]
  • [24].Zhang G.-Q, Tao S, Xing G, Mozes J, Zonjy B, Lhatoo S. D, Cui L, et al. Nhash: randomized n-gram hashing for distributed generation of validatable unique study identifiers in multicenter research. JMIR medical informatics. 2015;3(4):e4959. [Google Scholar]
  • [25].The brain initiative cell atlas network. Available online. https://braininitiative.nih.gov/research/tools-and-technologies-brain-cells-and-circuits/ brain-initiative-cell-atlas-network (accessed on 15 March 2025)
  • [26].Bargmann C. I, Newsome W. T. The brain research through advancing innovative neurotechnologies (brain) initiative and neurology. JAMA neurology. 2014;71(6):675–676. doi: 10.1001/jamaneurol.2014.411. [DOI] [PubMed] [Google Scholar]
  • [27].The national institutes of health brain research through advancing innovative neurotechnologies (brain) initiative. Available online. https://braininitiative.nih.gov/ (accessed on 15 March 2025).
  • [28].Tan Y. S. Reconstructing Data Provenance from Log Files. The University of Waikato; 2017. [Google Scholar]
  • [29].Kaaniche N, Belguith S, Laurent M, Gehani A, Russello G. 17th International Conference on Security and Cryptography (SECRYPT) SCITEPRESS-Science and Technology Publications; 2020. Prov-trust: towards a trustworthy sgx-based data provenance system; pp. p. 225–237. [Google Scholar]
  • [30].Javaid U, Aman M. N, Sikdar B. Blockpro: Blockchain based data provenance and integrity for secure iot environments. Proceedings of the 1st workshop on blockchain-enabled networked sensor systems. 2018:p. 13–18. [Google Scholar]
  • [31].Sigwart M, Borkowski M, Peise M, Schulte S, Tai S. A secure and extensible blockchain-based data provenance framework for the internet of things. Personal and Ubiquitous Computing. 2020:p. 1–15. [Google Scholar]
  • [32].Highcharts - interactive charting library for developers. Available online. https://www.highcharts.com/ (accessed on 15 March 2025)
  • [33].Richardson L, Amundsen M, Ruby S. RESTful web APIs: services for a changing world. “ O’Reilly Media, Inc.”; 2013. [Google Scholar]
  • [34].Bizer C, Heath T, Berners-Lee T. Linked data-the story so far. Linking the World’s Information: Essays on Tim Berners-Lee’s Invention of the World Wide Web. 2023:p. 115–143. [Google Scholar]
  • [35].Masse M. REST API design rulebook: designing consistent RESTful web service interfaces. “ O’Reilly Media, Inc.”; 2011. [Google Scholar]
  • [36].Ament S. A, Adkins R. S, Carter R, Chrysostomou E, Colantuoni C, Crabtree J, Creasy H. H, Degatano K, Felix V, Gandt P, et al. The neuroscience multi-omic archive: a brain initiative resource for single-cell transcriptomic and epigenomic data from the mammalian brain. Nucleic acids research. 2023;51(D1):D1075–D1085. doi: 10.1093/nar/gkac962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Kenney M, Vasylieva I, Hood G, Cao-Berg I, Tuite L, Laghaei R, Smith M. C, Watson A. M, Ropelewski A. J. The brain image library: A community-contributed microscopy resource for neuroscientists. Scientific Data. 2024;11(1):1212. doi: 10.1038/s41597-024-03761-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES