Uncovering abusive doctors that were allowed to continue practicing

How to comb through 100,000 discplinary documents without reading each individual one.

logistic regression text analysis classification natural language processing

Readings and links

Doctors & Sex Abuse, the project homepage
License to betray, the first installment of the series
About the investigation
Behind the scenes

Summary

The Atlanta Journal Constitution obtained over 100,000 disciplinary documents for doctors through scraping and FOIA requests. They need to narrow them down to only those about sex-related offenses, but reading 100,000 individual documents would have just taken too much time.

This project was similar to both the NYT airbags and the Washington Post app store investigations. The classifier itself to the simplistic reproduction, where Jeff Ernsthausen selected certain words to include or not include in the classifier.

Jeff had to continually tweak the classifier based on both positive and negative features. For example, "breast" could trigger a classification as being a sex-related disciplinary document, but could also signal something like "breast cancer." By adding "breast cancer" as a term he was able to cut down on accidentally flagged documents.

If you're interested projects like this, I recommend taking a look at the "I don't want to read thousands of documents" section of the topics page.

About the site

Hi, I'm Soma, welcome to Data Science for Journalism a.k.a. investigate.ai!

There's been a lot of buzz about machine learning and "artificial intelligence" being used in stories over the past few years. It's mostly not that complicated - a little stats, a classifier here or there - but it's hard to know where to start without a little help.

If you know a little Python programming, hopefully this site can be that help! Learn more about this project here.

Our newsletter

Links

Thanks to Columbia Journalism School, the Knight Foundation, and many others.

Uncovering abusive doctors that were allowed to continue practicing

Readings and links

Summary

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects