A Whitepaper by Dr. Nancy J Stark
Reading someone else's copy? Opt in to whitepapers and get your own.
Let's define a mid-sized device study as 100 subjects, 50 pages of paper case report forms per subject, and an average of 10 fields per page. Device sponsors are always surprised to hear that data management for such a study will require in the neighborhood of 2500+ hours. So what's all the time about?
How it is supposed to work
In a perfect world, data management would happen by the seamless convergence of data collection and database design. I'll discuss a paper-based model first, because that's what most start-ups use and it is easier to understand. Then I'll look at some electronic data capture (EDC) variations.
Data collection begins with defining what data to collect in order to support what you want to show. In other words, start with your envisioned indications for use, then determine how to measure device performance to that indication; the parameters you measure are called the endpoints. 'Endpoint' is an old term that used to mean the end of data collection. Today it means the pivotal data point that demonstrates product success or failure.
You progress to a set of case report forms that are easy for the Form Filler (at the study site) to fill out and error-proof for the Data Enterer (at the data management center) to enter. The case report forms should collect all the data that are necessary to support or refute the indication for use under study.
The case report form data is verified against the source files (medical records) by a monitor who is independent of the Form Filler. When the sponsor and investigator are separate parties, the monitor works for the sponsor and independence of the Form Filler is assumed. When the study is sponsored by the investigator (an investigator-sponsored study), the monitoring function should be outsourced to a third party who has no financial interest in the outcome.
...Handling test reports
Much of the data will be presented in the form of reports from, say, gastroenterologists or interventional radiologists. Statisticians can't work easily with the comment data found in reports. You'll need to develop questions that capture the essential elements of these reports in order to have usable data for analysis. You might have a series of questions, such as: "Is there any sign of cancer: yes, no, unsure." "What is the widest dimension of the cancer (enter 0 for no or 99.9 for unsure): __.__ mm." etc.
...Other data sources
There will likely be other sources of data too. Digital data may be recorded directly from the device. These data might be impedance, frequency, milliseconds, watts, or other parameters of device performance. Or these data might be measurements taken from the subject such as resistance, oxygenation, pH, or blood pressure.
Clinical laboratory data may be recorded on a laboratory information system (LIMS). These data may include analyst concentrations like glucose or potassium or cholesterol levels; or enzymatic levels such as alanine transaminase; or red, white, or plasma cell counts in the blood.
...Data dictionary and map
Next it is time to finalize the data dictionary and data map. The dictionary defines each data element. The definition may be self-evident but sometimes a more lengthy definition is required, perhaps the CT with contrast means a cat scan with barium contrast medium. The definition includes the data type, range of accepted values if any, and maps the data to its source. The dictionary and map aren't just for convenience. They make up part of the meta-data that enables the reconstruction of events at a future time, should that be necessary.
Using the correct data type is very important if you want calculations to run properly downstream. For example, to calculate the subject's age at the time of consent you would subtract the birth date from the consent date. If we erroneously enter the consent date as text, the calculation will return an error. It can be easily fixed, but it will probably cost half a day in labor.
Meanwhile, back at the data management center, the Database Designer designs a database uniquely built to receive the data from the various sources of data. The structure of questions on the case report form dictates exactly the structure of the database. If the question says: "How many apples did you eat today?" the database must receive integer data. If the question says: "What was your primary source of protein for lunch? Select one." The database will have a drop-down box that allows the Data Enterer to select one option from a predetermined list. If the question says: "Which of the following vegetables did you eat today? Check all that apply," a related table will be designed that allows several answers to be linked back to a specific subject.
In addition, remember the digital data that is coming in from the device? The database must be designed to accept these data when they are electronically imported and to automatically link the data to the correct subject. The same holds true for any clinical laboratory data that will be directly imported.
Finally, a set of management and outcomes reports will be coded. Management reports will enable managers and sponsor to follow the progress of the study without revealing study outcomes. A typical list of manager's reports includes:  Data entered by subject,  Data queries by subject, by site, by monitor,  Missing data,  Adverse events, and  Data received but not yet entered. One type of summary report is shown below. Here, each subject is represented by a row, and the case report forms make up the columns. The forms are color-coded to show if the data are entered (blue), missing (gray), outstanding due to queries (red), or received but not entered (yellow).
Outcomes reports might consist of, say,  time to an event,  number of events at day 14,  correlation of diagnosis with gold standard, or  any other calculated result.
To validate the database, three or four sets of faux patients are entered into the database. The phony data are entered, reports are run, and the answers are checked against manual calculations. In this way it can be confirmed that all the tables are synchronized, edit checks are firing properly, correct data types are being accepted as specified in the data dictionary and map (below), calculated fields are functioning, and database queries are working.
The final touch to the database application is to invoke a version control system. Once the design of the database is validated, the Access or Excel file is 'checked in' to a library such as Microsoft Sharepoint or SourceSafe. Then, whenever someone wants to make a change to the code they check it out of the library, make their changes, and check it in again. The version control software will automatically increment the version number, record who made what changes, and provide a space for notes.
...Convergence: merging data and database application
After the database application is validated it is ready to receive the 5000 pages of case report form data from the mid-sized study described above. This involves entering 50,000 data points! In our perfect world, our easy-to-enter forms are forwarded to the data management center where the Data Enterer enters the data without error. The data may also be re-entered, verified, or reviewed by a second person to improve data quality.
Once all data have been entered and reviewed, edits run, and those very expensive queries resolved (see below), there should be no missing or illegible data and no outstanding data queries. The digital data from the device are forwarded and match exactly with the patient numbers when uploaded. Ditto for the laboratory data.
...Data audit trails and other Part 11 issues
As data are entered into the database an independent, but corresponding, record is kept of what data are entered or changed, when, and by whom. This creates an audit trail of the data (as opposed to version control of the database application). Affordable audit trail/Part 11 add-ons are commercially available for both Access and Excel (still the most popular application for data management). The audit trails will track every entry and every change to data fields, generating a complete history of data entry and data changes.
...Data validation and lock
Earlier we validated the database, confirming the fields were the correct data type, which related tables were properly linked, and reports were accurate. After data entry, we validate the data itself. I like to print a report that 'looks' like a case report form. I print out, say, 10% of the forms and manually compare them to the case report forms used for input. And I print out a report of endpoints. A common level of error acceptance is 0% for pivotal endpoints and 0.5% for other fields. Finally, the data can be locked, imported into SAS, reports can be run, statisticians can analyze, and all is right with the world.
Electronic Data Capture
Viewing it simply, in a paper-based system the data are transcribed onto case report forms by the Form Filler at the study site. After source verification, the data are forwarded to the data management center for data entry by a Data Enterer. In electronic data capture (EDC) the validated database is uploaded to a server and study staff are given access via a password and secure connection. Rather than fill out a paper form, the Form Filler is also the Data Enterer. This is where you'll have a cost savings because the work isn't duplicated and a Form Filler/Data Enterer is entering data directly from the source files.
EDC systems are a worthwhile investment if your company expects to do many trials with similar designs and your investigators are comfortable with computers and applications.
How it usually works
Unfortunately, life is rarely a straight-forward path, and we often begin in the middle. The numbers of things that can go wrong are too many to enumerate. But here are a few likely problem-makers.
It seems so minor, changing a question on the case report form so it makes more sense to the Form Filler. But if you change the data type, it might be hard to merge data for that question from before and after the change. Even worse, you are looking at a couple of hours of database re-design and re-verification.
...Differing subject ID numbers
How can it be? Subjects are numbered 1-100 in the database, but 20100607.1-20100607.100 from the laboratory. As humans, we can see it is a date followed by a number, but our database does not understand. Another half-day is lost learning there is a problem, identifying the problem, and fixing the problem.
Adverse event data can require a lot of time to manage. The event must be followed until it is resolved or stabilized, and that may not happen for several months. Ongoing adverse events keep the database open and require continuing updates for an indeterminate length of time. Ongoing reports to Database Monitoring Committees or Clinical Events Committees also demand extra time.
Unfortunately, some forms will come in with missing values. When a value is missing, i.e., a question is unanswered, the integrity of the data are compromised. If the missing value is a pivotal endpoint, the Form Filler will be asked by data query to track down the missing data. If the value is no longer available, (let's say you wanted the patient's temperature immediately after the procedure), the statistician may be able to guess at a defensible value for the missing data, but it is still a guess.
Paper case report forms are written in handwriting. In spite of the best monitoring and source verification, some illogical or illegible data will slip through. "Now, let's see", thinks the Data Enterer, "is that a six or a zero?" A data query is sent to the monitor, who sends it to the Form Filler, who looks up the correct answer in the source file and sends it back to the monitor, who sends it to the Data Enterer. This is good for 15 minutes of Data Enterer time, 15-30 minutes of monitoring time, and 30 minutes of Form Filler time. Not counting the Form Filler, we can assume $125 per data query. If there is one data query per subject, that would be $62,500!
Thankfully, in the EDC world, illegible handwriting is not an issue for the data manager as the entry has already been reviewed by the site. Typically, in EDC studies, less time is spent dealing with handwriting and missing value queries and effort can be focused on more complex queries.
When data are coming in from multiple sources, it doesn't always arrive at the same time. You may have all the device data for subjects 1-37, the paper case report forms for subjects 1-30 except for subjects 3 and 23 which were held back, and no laboratory data. You'll need ongoing management reports to show what data has been received for what subjects, and what data the monitors should be tracking down at the site.
Throughout all of this, communication will be recorded by email messages. Messages should be written clearly and with detail. All phone conversations are summarized with a list of action items. When referring to a database or spreadsheet, use its name and version number. Communication is not only meant for the immediate reader, it is meant to be meta-data (data about the data) which can serve to reconstruct events at a point in the future. For contractors this is especially important; billing issues are usually resolved based on written communications.
Periodically, all the messages are selected and printed (memo style) to Acrobat Writer. The long pdf file is bookmarked with relevant tables of contents, burned to a CD, and stored as a permanent part of the technical file.
Not just CDG's numbers
I used my own data, of course, but I also called upon the goodwill of several friends to assemble the hours estimate below. A perfectly designed, perfectly executed study will not come with free data management.
How Clinical Design Group can help
Building a database, validating the database, invoking audit trails, rendering the database Part 11 compliant, entering and importing data, validating the data, exporting the data for SAS analysis, and documenting communication are all part of the data management services that CDG offers. Whether you want a paper-based system or an electronic data capture system, CDG has the capabilities and experience to help you.
Do you need all of these services, only part of them, or consulting to help you do it yourself? You can contact Clinical Device Group for help. Here are some of the Network Staff and resources who we have at to help you.
...Meet Wessam Sonbol, Database Designer
Wessam Sonbol is the President of Clinical Systems Consultants Inc. Mr. Sonbol graduated from the University of Minnesota’s Institute of Technology with a degree in computer science. Throughout his career he has held positions at Eminent Research/PPD, American Medical Systems and Sorin Group. He developed a global database with which he gathered data from hospitals all over the world to review cardiovascular diseases and understand the demographics of different causes. Wessam also developed http://www.tctmd.com, an informational website for doctors focusing on Trans Catheter Therapeutics; and Global View, a post-market eRegistry tool used by many organizations such as St. Jude and J&J. He also developed studies using Oracle clinical, Clindex, ERT and Rave and designed a MS Access 21 CRF 11 compliant database.
...Meet Michelle Secic, Biostatistician
Michelle Secic is the President of Secic Statistical Consulting, Inc. where she works on a variety of medical research projects. She aids companies around the world with the statistical aspects of their research from small projects to large clinical trials involving drugs or medical devices. She received her Master of Science degree in applied statistics in 1990. She worked at the Cleveland Clinic Foundation from 1990 through 2001, when she left to continue expansion of her own consulting company. She authored a book in 1997 titled "How To Report Statistics in Medicine", published by the American College of Physicians. The book was translated into Chinese and published in China in 2002; the second edition was released in 2006.
...Meet Dr. Robert Thiel, Biostatistician
Robert Thiel, PhD is the President of Thiel Statistical Consultants. Dr. Thiel has degrees in physics, clinical counseling and psychometrics. He is an expert in the analysis of IVD data for the Medical Device Industry with more than 75 510K and 5 PMA approvals. He has published papers in the areas of free PSA, breast cancer, ovarian cancer, liver disease and multivariate analyses as applied to sports (baseball and basketball).
His current interests lie in the area of Bayesian statistics, bootstrap and permutation test methodology, and the application of these methods to small sample trials.
...Meet Clindex, a web-based EDC system
Clindex (Fortress Medical Systems LLC, MN) offers a Clinical Data Management System, Clinical Trial Management System, and EDC System in one integrated solution. The software is designed for use with medical devices and diagnostic products (and okay, with pharmaceutical trials; but those of you who know me know I don't do drugs....) Clindex is used from start-up firms to Fortune 500 companies. It has been used in hundreds of clinical studies, international studies, and studies that supported FDA marketing applications. The FDA has inspected numerous companies using Clindex with no feedback of any findings or deficiencies.
I look forward to hearing from you!
Nancy J Stark, PhD
President, Clinical Device Group Inc