JMIR Publications (@jmirpub.bsky.social)

Harmonizing Logical Observation Identifiers Names and Codes (LOINC) Codes and Units in Real-World Oncology Data: Method Development and Evaluation Background: The expanding use of multisource real-world #ehr (EHR) and claims data offers major opportunities for #research, drug discovery, and clinical decision support. While standards such as Logical Observation Identifiers Names and Codes (LOINC) can ensure semantic interoperability for laboratory observations, clinical documents, and other clinical terms, properly assigning these concepts remains a challenge. Studies show that 6% to 19% of laboratory tests cannot be accurately mapped to LOINC. Existing systems try to address this challenge but often depend on source data strings and other input features that may be absent, null, or incorrect. This underscores the need for a scalable approach to correct LOINC code assignments, standardize units, and ensure data integrity across multisource laboratory data. Objective: This paper presents a universally applicable framework that identifies and corrects observable errors in quantitative laboratory results coded to LOINC and the Systematized Nomenclature of Medicine for the unit of measure without relying on raw source data strings. The process seeks to improve the accuracy, conformance, consistency, and completeness of laboratory data while maintaining complete provenance. Methods: The proposed framework uses a 2-step process. First, LOINC codes are corrected using the associated unit of measure. Second, units are adjusted or populated to match a preferred unit for that LOINC code. In both steps, the quantitative result is checked against a predefined acceptable range to determine validity. The process is driven by 3 knowledge tables. The framework is applied to datasets derived from the ConcertAI database of approximately 10 million #patients with cancer, evaluating improvements in LOINC code–unit conformance and unit completeness. Analyses are performed on 4 independently LOINC-coded datasets: the full ConcertAI dataset and 3 high-volume diverse subsets grouped by data source or EHR vendor. Results: A total of 428 LOINC codes were observed across 6.34 billion records in the ConcertAI database. All 4 datasets were processed using the proposed framework. Before applying the framework, 73.1% (4,634,610,173/6,337,101,453) of records in the ConcertAI dataset had correctly assigned units based on the laboratory reasonable range table; after application, this increased to 99.7% (6,322,375,200/6,341,230,213). Similar improvements were observed across the 3 EHR-specific datasets, increasing from 78.5% (691,315,390/880,250,137) to 99.8% (879,626,472/881,157,852; source 1), 71.4% (2,132,455,936/2,985,465,124) to 99.8% (2,982,319,644/2,988,173,959; source 2), and 63.3% (2,936,710,502/4,640,432,294) to 99.6% (4,618,714,114/4,638,862,412; source 3). Unit completeness also improved substantially, increasing from 92.7% (5,879,071,858/6,341,230,213) to 99.8% (6,331,923,060/6,341,230,213) in the ConcertAI dataset and from 92.5% (814,867,241/881,157,852) to 100% (880,816,133/881,157,852), 94.4% (2,822,107,252/2,988,173,959) to 99.9% (2,986,624,027/2,988,173,959), and 91.7% (4,254,054,966/4,638,862,412) to 99.8% (4,632,935,919/4,638,862,412) in sources 1 to 3, respectively. Conclusions: Laboratory data quality is crucial in oncology systems for therapy selection, monitoring, and disease progression assessment. This proposed solution is a first-of-its-kind, system-agnostic, and scalable normalization process that addresses key gaps in laboratory data quality across multiple dimensions. Trial Registration: http://dlvr.it/TRNrVK