Shouldn’t This Be Easier By Now?
Eventually, someone in Information Technology or Database Administration gets asked to extract data from a PHI rich line of business system or data warehouse but deliver it as de-identified data. Almost any data extraction approach allows for data to be masked, redacted, suppressed or even randomized in some way. This type of functionality can give us de-identified but often useless data for testing, analytics or development.
Since my company, The EDI Project™ was founded in 2001, we have been asked to de-identify or anonymize data for testing and development work many times. Each time we have written custom code to do so for each project. This code is never transferable to another customer environment and must be re-done for every scenario. If we were doing this every time, we thought there has to be other companies who are having the same problem.
It turns out, there are tools on the market to address extracting data from a line of business system or data warehouse and anonymize the data so it is useful and not just de-identified into useless “John Doe” records.
For example, one of the largest integration engines on the market offers this functionality as a $250,000 add on to their existing, very expensive suite of products. It is complicated to learn and use and must have custom code added if multiple systems are required to be anonymized the same way (e.g. enrollment, eligibility and claims data have to have matching but anonymized names and dates of birth).
There are other tools in this space that sniff out vast data stores for PHI and attempt to automagically de-identify the data. Usually this is a masking or data redaction type approach, but even when it is not, many fields are marked as “suspect PHI” and left for human review. I can’t blame them either. While Patient Name fields or Date of Birth are easy enough to identify, free form fields can be a nightmare. Either way, these tools are usually very expensive and often leave the job half done.
There are a lot of cases where a certain files like EDI 837 Claims or maybe an enrollment database has to be de-identified for a test system. Perhaps it is an ongoing extract of data from a data warehouse for an analytics study. This is where most of the time, the work is either not done (exemption granted), or custom code is deployed (expensive / time consuming). But technology is supposed to be faster, better and cheaper isn’t it?
Since we are the guys who are often asked to do the work looked at our experience in extraction of health care data to design a tool we would want to use. No compromises. We wanted easy to learn and use, powerful to handle big data environments without being a bottleneck to any extraction work. Finally, it would be able to anonymize data across multiple sources so that the matching but de-identified data maintained record integrity (i.e. all the records for one patient in the PHI data sources had corresponding records in the de-identified data sources). Oh yeah – and since the main project being done is already expensive enough, the tool should be inexpensive.
People have been using ETL (Extract, Transform, Load) tools for decades and are familiar with how they work. Thinking about the “T” in “Transform”, a common thing to do would be to change a date from MMDDYYYY format to DDMMYYYY format. This type of common transformation logic doesn’t have to be rewritten every time you extract from a new source. The integrator just picks it from a list when doing mapping work. Anonymizing PHI should be that simple as well.
Functions and drop downs need to be available to anonymize every kind of PHI and handle it according to the special properties for that type of data. Names are anonymized differently than zip codes. More specifically, the anonymization routine for a Date of Birth (DOB) is handled differently than a Date of Service (DOS). The software should know that already and not need to be defined by the integration team or subject matter expert.
As a result, we developed and launched our own Anonymization Engine called “Don’t Redact!™”. We’re integrators and so we built the tool an integrator would want to get this done quickly and easily. It can be learned by someone who has experience with integration tools in an afternoon and your first sizeable anonymization effort can be deployed in a day or so after learning the ropes.
Under the spirit of no compromises and disruptive technology, the Don’t Redact!™ Anonymization Engine is $25,000.
While The EDI Project™ is a professional services organization and we would be happy to deploy the software for you or set up your first live anonymized environment, the tool is well thought out and easy enough you won’t need any services at all.
Want to find out more? http://theediproject.com/anonymization.html
Part 1: Minimum Necessary or Optional
Part 2: A False Choice. . .