Database and Multi-database Design

Geoffrey Fox gfc@npac.syr.edu
Bill Braithwaite Bill%TROPIQ@VAXF.Colorado.edu
Alton Brantley bab2@mhg.edu
Marina Chen mcchen@cs.bu.edy
Jerry Cox jrc@hobbs.wustl.edu
Dave O'Halloran Dave.Ohallaron@n2.sp.cmu.edu
Judy Ozbolt ozbolt@virginia.edu
Robert Sanders bobs@aplcomm.jhuapl.edu
Paul Silverstein pauls@aplcomm.jhuapl.edu

Introduction

Currently there is not much medical data available in a form where it can be used for large scale comparative analyses. However we expect this to change as current practice improves and we move from computer databases largely oriented towards billing to databases aimed at recording and improving the patient's care and health. We expect this new data to be encoded in a reasonably uniform fashion using the standard vocabularies developed by the industry and medical informatics community with the National Library of Medicine.

The pervasive preparation of such patient data will allow one to meaningfully aggregate them and compare individual records with averages or more generally with average templates (care maps) formed from subsets of the data. The databases will be provided by many vendors and use many different internal technologies. However information interchange will be enabled by the use of standard interface and transport protocols. National standards will probably require federal mandates but nevertheless one can expect standardization internally to extensive hospital groupings. These would be large HMO's, consortia including those of Health Science centers, and Medicare and Medicaid providers. These individual datasets will be large enough to support significant comparative analysis activities whose complexity and size will demand HPCC technologies.

A set of individual patient records will be the basis of this care oriented dataset. This will then generate other databases such as those used in billing. Further one would link the patient database with auxiliary databases used to define such items as hospital facilities and procedures. We anticipate some 100 million and eventually many more patient records with for example a full database size of 10 terabytes corresponding to 100 text pages of information for each of 100 million patients. (One text page is about a kilobyte in size). The databases will have growing amounts of image and video information which will imply major storage and processing requirements. Note that we expect that the information and entertainment component of the future National Information Infrastructure, whose size will be measured in petabytes, will largely consist of video data. This can be contrasted with medical databases which will probably consist of a relatively larger fraction of images and text. These large scale national databases with uniform patient oriented medical data will be produced in an evolutionary fashion. We can expect them to become a significant component of medical information systems over the next ten years. Hence they appear as a very suitable target for NSF basic research leading to deployable systems on that time scale.

Health Care Applications

Imagine a set of longitudinal patient records recording the care and health of every patient. These records will be most interesting for analysis if they are large and of national scope so as to allow statistically meaningful comparative analysis for both clinical care as well as health science and clinical research. A critical application of this database is the identification of medically distinct models or templates for diagnosis and care maps. Although there are important unresolved policy and practice issues on data sharing, it is likely that either federal mandates or natural medical collaboration will generate such major medical databases of national scope. This data, although accessible in a uniform fashion, will be stored nationwide in geographically distinct distributed heterogeneous databases residing on a rich variety of computers. This distributed heterogeneous characteristic is shared with the general National Information Infrastructure. However a key complicating feature of medical data is the richness of the internal structure and the correlations between features both internal to and between records.

Medical Informatics Research Issues

Functionalities needed in the Preparation of the Distributed Medical Database

Patient records are very complex and subject to data entry error. Quality control in the data entry function is very important and will be aided by an automatic comparison of each new entry with the expectations of suitable averages from the existing national database. This will flag a possibly incorrect entry and allow operator correction.

As patients move around, it will be very helpful to build a complete medical record for individuals by computer aided matching of the record fragments stored in different databases of usually different care providers. This function has interesting analogies wit the matching of genome strands now done during the creation of the Human Genome database.

Typical Functionalities needed in Use of and Analysis of the Distributed medical database

Detection of anomalies or outlying entries in either individual records or more usually dynamic groupings of records. This function can be used in approaches to fraud detection and the identification of epidemics as in the "Colorado Vignette" presented at the meeting. The fraud capability has already been demonstrated by Booz Allen and Hamilton and similar techniques can be used in credit card and other financial fraud.

Segmentation of medical data into typical models or templates. This is analogous to market segmentation problem where HPCC is already being used by the mail order industry.

Comparison of individual patients with templates to aid diagnosis and establish canonical care maps.

Analysis of the effectiveness of particular care plans including the results of deliberate or accidental deviation from the recommended care.

HPCC Issues in Database Systems and Datamining

The patient record database scenario and the selection of interesting analysis functionalities suggest many important computer science issues. Many of these would be best tackled by a collaboration between computer science and medical informatics researchers.

The architecture and systems issues for the distributed dynamic sparse irregular patient database.: Information is continuously updated; each patient has a sparse selection of the many possible attributes with the different records being very heterogeneous or irregular. Study of the relevance of relational, object and hybrid database structure is included here.
Aggregation of distributed database for global studies: There are a set of choices with at one extreme pure Web technology with knowledge agents accessing the database which itself stays put in distributed form. The other extreme involves occasional collection of all needed information to a central aggregated database which is then mined. Intermediate solutions correspond to generalizations of caching strategies familiar in parallel and distributed computing.
Security and privacy: These are critical issues in many parts of the health care problem and indeed the full National Information Infrastructure. Datamining and extraction of templates raises a special need: namely, the protection of individuals whose medical data is used to form a rare template from which it may be possible to identify the contributing individual records.
Benchmarks and collaborations: Although there are important general research issues, we believe very it would be helpful to identify some benchmark problems. These can be used to quantify and motivate the proposed health care database research. This use of "real" medical records will be needed even for the generic precompetitive systems we expect to be built as part of NSF research programs. Further we note that some of the proposed research areas would benefit by linking to other (national challenge) areas. For instance, the distributed database system and searching issues have much in common with the challenges of building the World Wide Web or more generally the National Information Infrastructure. Several of the datamining tools are applicable to a wide range of business problems.