Errors in clinical research databases are common but relatively little is known about their characteristics and optimal detection and prevention strategies. We have analyzed data from several clinical research databases at a single academic medical center to assess frequency, distribution and features of data entry errors.
Error rates detected by the double-entry method ranged from 2.3 to 26.9%. Errors were due to both mistakes in data entry and to misinterpretation of the information in the original documents. Error detection based on data constraint failure significantly underestimated total error rates and constraint-based alarms integrated into the database appear to prevent only a small fraction of errors. Many errors were non-random, organized in special and cognitive clusters, and some could potentially affect the interpretation of the study results. Further investigation is needed into the methods for detection and prevention of data errors in research.
Materials & Methods
However, errors of direct care are not the only ones that can harm patients. Errors in clinical research, if large enough to affect the investigators’ conclusions, can have much greater impact on clinical outcomes by swaying the standard of care of thousands of patients5. In fact, a number of reports have shown that errors are common in clinical research databases6–9. Nevertheless, relatively little is known about the types of errors in research databases, their characteristics and possible effects on research conclusions. We therefore undertook this project to examine prevalence and features of apparent errors in several clinical research databases.
Dataset We analyzed the data from several research databases that contained information about treatment and outcomes of oncologic patients who underwent radiation treatment at a single academic medical center. The databases used MS Access client and PostgreSQL database server. Standard MS Access forms graphical user interface was used for data entry. All data in these databases were entered manually by trained technicians, usually being copied from electronic or paper medical records. Constraints by parameter-specific ranges and dynamic constraints based on values in other fields were used to minimize data entry errors. Individuals who entered specific records were not tracked. A typical record contained the patient’s demographic information, date of diagnosis of their condition (defined as the date of biopsy), dates of initial and final outpatient radiation treatment visit, date of last follow-up visit (after the radiation treatment course had been completed), and current follow-up status (remission, relapsed, deceased from the treated cancer, deceased from other causes). We have employed two strategies for identifying erroneous entries: highly improbable / internally inconsistent data and data discrepancies between duplicate data entries in different databases (externally inconsistent data).
Impossible / Internally Inconsistent Data To evaluate data in research databases for impossible entries and internal inconsistencies we analyzed two databases (subsequently referred to as “B” and “S”) that contained data on treatment and outcomes of oncologic patients. Both databases contained similar data fields. However, while database B primarily contained information on patients who were diagnosed at the same hospital, database S contained a substantial fraction of patients who were diagnosed elsewhere and were subsequently referred for treatment.
In each of these databases we evaluated data for the following impossible conditions:
- Date of diagnosis falls on a Sunday (date of diagnosis was defined as the date of the biopsy which are not normally conducted on weekends)
- Date of the first radiation treatment falls on a Sunday (radiation treatments are usually only administered Monday through Friday)
- Date of the last radiation treatment falls on a Sunday
- Date of the last follow-up visit falls on a Sunday
We also analyzed the number of data entries that triggered data integrity alarms incorporated into the databases. The alarms were triggered by the following impossible conditions:
- Date of Diagnosis (database B only): triggered by date of diagnosis > date of the pathology report, date of diagnosis > date of initiation of chemotherapy, date of diagnosis > date of relapse, date of diagnosis > date of the last follow-up appointment.
- Date of the first radiation treatment (both databases): triggered if < date of diagnosis, > date of last follow-up, > date of last treatment, > 3 months before the date of the last treatment (database B only: courses of radiation treatment for patients included in that database cannot be longer than 3 months)
- Date of the last follow-up visit: triggered if < date of entry.
For both of these databases we also assessed internal consistency of the data on the example of concordance of the fields containing information about vital status and relapse status. These fields were considered internally inconsistent if vital status was recorded as “deceased from the cancer” but no relapse was documented for patients who were known to have gone into remission after conclusion of their initial course of treatment.
Externally Inconsistent Data To analyze the data in research databases for external inconsistencies we analyzed 1,006 patient records that were incidentally entered in two different databases (subsequently referred to as “P1” and “P2”) at the same time. We analyzed the discrepancies between the records of the same patients in the two databases in the following fields: medical record number (MRN), date of birth (DOB), first and last name, number of treatment sessions, and the dates of the first and last treatment session. All of the demographic information fields were entered on one screen in both databases, and all of the information related to treatment was entered on another screen.
In addition to analyzing discrepancies between individual fields in the two databases, we also analyzed concordance between discrepancies in the fields entered on the same screen and the fields entered on different screens.
To demonstrate a potential effect of errors in research data we also analyzed for mutual consistency two datasets on local tumor recurrence in 133 patients that were independently entered by two physicians. We assessed the differences in time to recurrence derived from each of these two datasets (which should have been completely identical).
Statistical Analysis Binominal distribution was used to calculate exact 95% confidence limits for error frequencies. Fisher’s Exact Test was used for analysis of 2×2 tables. Survival curves were compared using a log-rank test. All analyses were performed in SAS software program (Version 8.1;SAS, Cary, NC). All statistical tests were 2-sided.
IRB The study protocol was reviewed and approved by Partners Human Research Committee. View Full Report & Results