Towards Data Interoperability: Practical Issues in Terminology Implementation and Mapping

Lee Min Lau MD PhD, Shaun Shakib MPH 3M Health Information Systems

Previously presented at the 2005 Health Informatics Conference, Melbourne, Australia, July 31 to August 2, 2005

ABSTRACT

Introduction: Many health care organizations face the challenge of data interoperability. Standard vocabularies are a means of encoding data for exchange, comparison or aggregation among systems. However, there are issues concerning their use in the Electronic Health Record (EHR).

Approach: The 3M Healthcare Data Dictionary (HDD) integrates and supplements standard vocabularies, presenting a master reference terminology used to encode data in the Clinical Data Repository (CDR). The HDD content is cross-referenced to standard vocabularies. Each HDD concept is identified by a Numerical Concept IDentifier (NCID), and the identifiers from external terminologies are mapped to it. NCIDs are used to encode the data in the CDR. The code in the incoming message to the CDR is translated by the HDD to the corresponding NCID for storage. If the sending system uses non-standard codes, then these legacy codes are first mapped into the HDD. Through mapping, the HDD translates between standard vocabularies, between legacy systems, and between a legacy system and a standard vocabulary. For external data exchange, the HDD translates the encoded data from NCIDs to a standard code.

Discussion: An alternative approach is to use the identifiers of standard vocabularies, instead of NCIDs, to encode data in the CDR. While standardization and interoperability are achieved, challenges remain.

Shift in meaning of a standard code: If a standard code is used to encode data, then, if the meaning of the code changes over time (code reuse), the data will be interpreted incorrectly. HDD concepts never change their meanings, thus the reused code will be mapped to a different concept, and data encoded with NCIDs will not be misinterpreted.

Removal of standard codes: If data is encoded with a standard code that has since been removed by the vocabulary, the data is no longer interpretable. An NCID is never deleted; an inactive standard code is marked as such without affecting the CDR data.

Lack of comprehensive standard codes: A standard vocabulary may not provide all the codes that correspond to the entire set of data in current use. The HDD provides all concepts needed to encode data in the CDR. If a code is later assigned by the standard vocabulary, it is mapped to its corresponding NCID in the HDD, with no updates required for the CDR.

Local extensions: Local extensions are codes added at each facility. The HDD coordinates all local extensions within an organization so the data is standardized enterprise-wide.

Historical patient data: If historical data has been encoded with non-standard vocabularies, then, encoding the data using standard vocabularies from this point onwards would result in the historical data being not interoperable. The HDD maps legacy codes to NCIDs to ensure interoperability for historical data.

Conclusion: The lack of data interoperability is a key obstacle to realizing the full benefits of an EHR. Standard vocabularies provide the means for interoperability, but present problems surrounding their use. The HDD approach evolved from meeting user needs. We share our experience in the interest of promoting data standardization.

Introduction

Many health care organizations have invested significant effort in the Electronic Health Record (EHR). Departmental information systems capture patient data to provide a multitude of administrative and clinical functions. Organizations face the challenge of exchanging, comparing, aggregating or integrating data among its multiple systems or facilities, and with external organizations. Unfortunately, the data collected is often in a non-standard, non-structured or even non-coded (text) form, resulting in a lack of interoperability. Interoperability means that the data encoded at one site is interpretable at another site, as if the data were encoded there. Interoperability allows data to be used in applications regardless of origin, and to be aggregated and compared across location and time. One means to achieve interoperability is data standardization.

Data standardization refers to the use of the same set of codes to encode data throughout a system. As an example, for the domain of "sex", one may decide always to code the sex of male as "1", female as "2", and unknown as "3". The domain of "sex", consisting of three members, "male", "female" and "unknown", forms a vocabulary, albeit a very simple one. If all data about sex is coded consistently according to this vocabulary, the data should always be understandable and usable for analysis in a longitudinal fashion and across populations. Thus, standard vocabularies have been advocated for use in encoding patient data.

A standard vocabulary or coding scheme is one that has wide industry acceptance or use. Standards are obtained from a variety of efforts. The US federal government has purchased a national license for the Systematized NOmenclature of MEDicine Clinical Terminology (SNOMED CT). The National Library of Medicine (NLM) maintains the Unified Medical Language System (UMLS), with the latest focus being RxNorm, a reference terminology for clinical drugs. Standards are developed by consensus industry effort, such as version 3 of Health Level 7 (HL7). The set of standards to be used by US government agencies will be named by the Consolidated Health Informatics (CHI) initiative - the first vocabulary standard selected is Logical Observation Name Identifiers and Codes (LOINC). Examples of vocabularies that are considered standards for billing are the International Classification of Diseases (ICD) from the World Health Organization (WHO) and the various country-specific versions. The National Drug Code (NDC) may also be considered a standard for use within the US pharmacy industry.

Approach

Experience has shown that standard vocabularies present many challenges in their implementation and use in the EHR. A standard vocabulary or coding scheme is usually designed to be the reference for a specific domain, e.g., laboratory tests; or to fulfill a particular goal, e.g., mortality and morbidity classification. A single standard vocabulary is unlikely to meet all the EHR needs of an organization. Our solution is to build a Vocabulary Server application [1] that integrates, and supplements, the relevant standard vocabularies for the appropriate domains, presenting a master reference terminology. The application, known as the Healthcare Data Dictionary (HDD), is designed to support the integration of coded data in the Clinical Data Repository (CDR). The content of the HDD is cross-referenced to standard vocabularies, e.g., SNOMED CT; reference sources, e.g., UMLS; and classification schemes, e.g., the International Classification of Diseases, 9th Edition, Clinical Modification (ICD9CM). External vocabularies are not loaded as disparate islands of code sets, but are mapped to unique concepts in the HDD, and linked by the appropriate relationships. Each HDD concept is identified by a meaningless Numerical Concept IDentifier (NCID), and the identifiers from external terminologies are mapped to it. For instance, the concept of chickenpox would have the UMLS Concept Unique Identifier (CUI) of C0008049 and the SNOMED CT ConceptID of 186513009 as mapped representations. It would also have an "ICD9CM Code" relationship to "052.9".

NCIDs are used to encode the data in the CDR. Data enters the CDR through incoming transactions such as Health Level 7 (HL7) messages, or via data entry programs such as a clinical workstation. The code in the incoming message is translated by the HDD to the corresponding NCID for storage in the CDR. Standard codes such as SNOMED CT ConceptIDs, UMLS CUIs or LOINCs are already in the HDD and can be sent in incoming transactions. If the sending system uses non-standard codes, then these legacy codes are mapped into the HDD so that when they are received in incoming messages, they can be translated to NCIDs. Through mapping, the HDD translates between one standard vocabulary and another, between legacy systems, and between a legacy system and a standard vocabulary. For an organization to exchange data with external systems, the HDD translates the encoded data from NCIDs to the requested external standard code (e.g. LOINC).

The 3M HDD has been used to encode data into the CDR for multiple commercial and government facilities and systems in the US. As an example, a total of 3 million legacy metadata items from the five domain areas of Demographics/Encounters, Laboratory, Microbiology, Pharmacy and Radiology/Text Reports have been mapped for all US DoD facilities. Before the HDD mapping, the DoD facilities functioned as standalone systems because the legacy codes are all different from one host site to another. The patient data was not interoperable. Now, NCIDs are used to encode patient data in the DoD CDR, and two years of historical patient data is pulled from each legacy system. Previously isolated islands of legacy data are now interoperable across all DoD locations.

Discussion

An alternative approach would be to use the identifiers of standard vocabularies to encode patient data in the CDR, instead of NCIDs. While standardization and interoperability would be achieved with this approach, other challenges remain. The HDD evolved to manage the various issues encountered in using standard vocabularies in the EHR. In the interest of promoting the use of standard vocabularies for data interoperability, we present our solution and experience as lessons learned.

Shift in meaning of a standard code: If the code from an external standard vocabulary is used to encode data, then, if the meaning of the code changes over time, the data will be interpreted incorrectly. For instance, NDC 00074433501 was given to the drug product "Liposyn (Fat Emulsions), 10%, Intravenous Solution, Intravenous, Abbott Hospital, 200ml Bag" until July 2002, when it was reassigned to "Paclitaxel (Paclitaxel, Semi-Synthetic), 6mg/ml, Vial, Injection, Abbott Hospital, 5ml Vial". If the NDC is stored as the patient's medication data, then after July 2002, the patient will mistakenly be thought to have been given Paclitaxel. This misinterpretation of the data has obvious negative effect on population based data use. If the patient is still under current treatment, the data error can potentially affect clinical care.

Code reuse is a common problem for many standard coding schemes, e.g., ICD9CM. Softwares that provide code sets usually provide just the latest version, with no backward compatibility. Often, a vocabulary makes what it considers an "adjustment", e.g. when LOINC changes the specimen attribute of its laboratory results from Serum and Plasma, separately, to Serum/Plasma. However, the resultant implications to the data are often non-trivial.

The HDD needs to insulate the stored patient data from perturbations in the external vocabularies. Concepts created in the HDD never change their meanings so that patient data encoded with NCIDs will never be misinterpreted. Since external standard codes are mapped to NCIDs, as external codes are reused, the mapping in the HDD will change accordingly. In the above example, the first drug product has NCID 3000493238 (Liposyn). The second drug product has NCID 3000536480 (Paclitaxel). The NDC of 00074433501 was mapped to NCID 3000493238 (Liposyn) in the Active NDC Context until July 2002, when it was moved out of the Active NDC Context into the Inactive NDC Context. At the same time, the NDC of 00074433501 was added (mapped) to NCID 3000536480 (Paclitaxel) in the Active NDC Context. The patient data is stored in the CDR as NCIDs, not NDCs; so the information is always correctly stored and interpreted as Liposyn or Paclitaxel regardless of the NDC reuse. In addition, through the HDD one can trace the "path" of the NDC and find out what a drug product's NDC is at a point in time. This is particularly important for accurate data interoperability and exchange. Otherwise, two communicating institutions might falsely assume the data to be the same because the code is the same - as in the above example - when they are not, because of code reuse.

Removal of standard codes: Frequently, standard vocabularies may retire or delete codes. If patient data is stored using the removed code, it will no longer be interpretable. For instance, the American Medical Association's Current Procedural Terminology (CPT) code 0002T, "endovascular repair of infrarenal abdominal aortic aneurysm or dissection; aorto-uni-iliac or aorto-unifemoral prosthesis", was effective January 1, 2002, and was terminated from use after December 31, 2003. Versions of CPT after this date will not contain code 0002T. If previous versions of CPT are not maintained in a master file or data dictionary, patient data stored as code 0002T prior to January 1, 2004 will not be interpretable.

In the HDD, an NCID, once created, is never deleted. In the above example, the CPT code of 0002T is mapped to NCID 14780136, in the CPT Code Context. When the code of 0002T is no longer needed, while remaining mapped to NCID 14780136, it will be "moved" to the Inactive CPT Code Context. Because it is the NCID that is used to encode data and not the CPT code, the patient data will not be lost to use.

Lack of comprehensive standard codes: A standard vocabulary may not provide all the codes that correspond to the entire set of data in current use. This is particularly true of standard vocabularies that are built via voluntary submission from participating organizations over time, e.g., LOINC. In certain use cases, mandating the use of only the codes from a standard coding scheme may be acceptable, e.g., ICD9CM for reimbursement. In other situations, particularly where clinical care or workflow is concerned, it is important to capture the data accurately according to what really occurred.

The LOINC database for laboratory results was started with the master files from seven US laboratories. It was first released in April 1995 with approximately 6,500 codes, and has since grown through submission from laboratories, hospitals, and other organizations, notably 3M. The latest release in December 2004 contains approximately 28,000 laboratory codes and roughly 14,000 clinical observation codes. LOINC is released periodically. In 2003 it was released in May and October; in 2002, January, February, August and September; in 2001, January and July; in 2000, February and June. If LOINC codes are to be used to code data directly to be stored in the CDR, is the clinician restricted to ordering only those laboratory tests that currently have associated LOINC codes? This could have a significant impact upon clinical practice and workflow, or worse, lead to either imprecise information or data gaps in the CDR.

This is why the LOINC committee recommends that LOINC codes should be recorded "as attributes of existing test/observation master files" for use in the appropriate message segments to communicate among systems. Many laboratory observations will never receive a LOINC code. Some are used for internal system processes, e.g., "DoD DNA Samples". Others serve as text placeholders for send-out laboratory results, e.g., "Cystic Fibrosis DNA Test". Yet others have attributes that are not compliant with LOINC definitions or rules, e.g., "ABO Group, Serum or Plasma, Qualitative". Creating these laboratory results as HDD concepts and using the NCIDs to encode data in the CDR would prevent interruption to workflow. Those laboratory observations that are suitable for inclusion in the LOINC database would be submitted to the LOINC Committee. With each LOINC release the HDD is updated by assigning (mapping) the new LOINC codes to their corresponding, existing NCIDs. Since it is the same NCID that is stored whether there is a LOINC code or not, there is no need to do any update or transformation to the data in the CDR.

Local extensions: Local extensions are codes added at a health care facility to encode data when the appropriate, equivalent code is not yet found in the standard vocabulary. Many of these may never be added to standard vocabularies. Local extensions provide concepts needed by the facility for a different granularity or compositional structure. For example, a standard for race may provide "Asian/Pacific Islander", whereas a facility may wish to differentiate between them. Local extensions are common for medications compounded at the facility. There may also be local codes created to support applications or processes. Local extensions are critical for efficient workflow and data capture at the facility. However, it is the biggest reason why many health care organizations have data interoperability problems among their systems.

The HDD is used to coordinate local extensions within an organization. It enables the organization to support clinical functions by capturing complete, accurate and appropriately detailed data, encoded with enterprise-wide codes. The data is thus standardized within the organization, even though the code is not found in the external standard vocabulary. System-wide processes can be implemented for efficiency and cost savings. Decision support and population queries can be applied to all enterprise data. To restrict the organization to the use of only standard codes will not serve the needs of the enterprise.

Historical Patient Data: Many health care organizations have used information systems for a significant duration and have amassed a considerable amount of patient data. These historical patient data is important for continuity of care, quality of care delivery, population health management and outcomes research. If the legacy data has been encoded with non-standard or proprietary vocabularies, then, encoding the data using standard vocabularies from this point onwards would result in the system behaving as if data collection is only starting now. Paper records or text printouts would be needed for past medical data, e.g., for a follow up visit. Essentially, the historical data is lost to computable clinical or administrative use.

Because the HDD maps legacy codes to NCIDs, historical data that has been encoded with the legacy codes are then "translatable" and interoperable with current data. The HDD mapping is also useful for organizations that need to continue using legacy systems and codes, or that store patient data in multiple databases in addition to the CDR.

Conclusion

The lack of data interoperability is a key obstacle in the implementation and use of EHR to achieve its full benefits. Standard vocabularies provide the means for organizations to exchange, compare and aggregate data externally. The problems surrounding the use of standard vocabularies are not new. The HDD approach to terminology development, implementation, mapping and maintenance evolved from meeting the EHR needs of our users. We share our experience and lessons learned in the interest of promoting data standardization.

References

1. Rocha R, et al. Designing a Controlled Medical Vocabulary Server: The VOSER Project. Computers and Biomedical Research, 27(6): 472-507, 1994.


Source: Clinical Vocabulary Mapping Methods Institute, 77th AHIMA Convention and Exhibit, October 2005