Building Doctective
Databases
Doctective uses databases available through John Snow Labs to find whether doctors have received payments from pharmaceutical companies in your area. It uses the 2015 CMS Open Payments Database, the 2016 NPPES Provider Database, and the National Drug Code Directory to determine payments made near a location, information on specific physicians, and drugs manufactured by companies that sponsored hospitals or physicians.
Difficulties & Data Retrieval
We originally intended to do an individual physician level analysis of this issue, but most payments are disclosed to a hospital street address; not a PO box. Additionally, the data contained some inconsistencies in spelling, abbreviations, ZIP code format, and more. We had to normalize the data to remove these errors and more, strip unnecessary information, and load the data into redis.
Additionally, the Open Payments data is intentionally somewhat anonymized by the government. The data is provided, by law, without physicians' NPI (National Provider Identification), which means that it is difficult to correlate to other databases, all of which connect physicians to their NPI. Only ~10% of transactions between pharmaceutical companies is directly to physicians however; the vast majority of transactions are to hospitals themselves. Address reporting discrepancies and the time required to filter through them prevented us from returning pharmaceutical payments at a hospital level, so we decided upon ZIP code. ZIP code was occasionally in the wrong column, but even without column information we could identify it through its unique format.
Aggregation and Reordering Data
Data provided by ZIP code, for example, were spread across millions of lines of information. We had to reorder a large quantity of data, often limited by the memory our computers could hold.
Log in or sign up for Devpost to join the conversation.