Uncovering abusive doctors that were allowed to continue practicing

How to comb through 100,000 discplinary documents without reading each individual one.

logistic regression text analysis classification natural language processing

Readings and links


The Atlanta Journal Constitution obtained over 100,000 disciplinary documents for doctors through scraping and FOIA requests. They need to narrow them down to only those about sex-related offenses, but reading 100,000 individual documents would have just taken too much time.

This project was similar to both the NYT airbags and the Washington Post app store investigations. The classifier itself to the simplistic reproduction, where Jeff Ernsthausen selected certain words to include or not include in the classifier.

Jeff had to continually tweak the classifier based on both positive and negative features. For example, "breast" could trigger a classification as being a sex-related disciplinary document, but could also signal something like "breast cancer." By adding "breast cancer" as a term he was able to cut down on accidentally flagged documents.

If you're interested projects like this, I recommend taking a look at the "I don't want to read thousands of documents" section of the topics page.