Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2025 Aug 26:2025.08.21.25334173. [Version 1] doi: 10.1101/2025.08.21.25334173

Scaling Sensor Metadata Extraction for Exposure Health Using LLMs

Fatemeh Shah-Mohammadi, Sunho Im, Julio C Facelli, Mollie R Cummins, Ram Gouripeddi
PMCID: PMC12407612  PMID: 40909811

Abstract

Background

The rapid evolution and diversity of sensor technologies, coupled with inconsistencies in how sensor metadata is reported across formats and sources, present significant challenges for generating exposomes and exposure health research.

Objective

Despite the development of standardized metadata schemas, the process of extracting sensor metadata from unstructured sources remains largely manual and unscalable. To address this bottleneck, we developed and evaluated a large language model (LLM)-based pipeline for automating sensor metadata extraction and harmonization from exposure health literature publicly available.

Methods

Using GPT-4 in a zero-shot setting, we constructed a pipeline that parses full-text PDFs to extract metadata and harmonizes output into structured formats. Results: Our automated pipeline achieved substantial efficiency gains in completing extractions much faster than manual review and demonstrated strong performance with average accuracy and precision of 94.74%, recall of 100%, and F1-score of 97.28%.

Conclusions

This study demonstrates the feasibility and scalability of leveraging LLMs to automate sensor metadata extraction for exposure health, reducing manual burden while enhancing metadata completeness and consistency. Our findings support the integration of LLM-driven pipelines into exposure health informatics platforms.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.


Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES