eye examining planet

Practical data science for journalists (and everyone else)

If you know some Python and have dabbled in data, we're here for you! Let's add a dash of machine learning and a sprinkling of stats to your skillset.

And if you don't know Python, take this and this and call me in the morning. Or go all-in, maybe?

Topic walkthroughs

Practical, start-to-finish guides on data science concepts and tools. Not (too) boring, not (too) mathy, they're hopefully just what you're looking for.

See our topics guide

Real-life examples

Theory on its own doesn't do much! Practice your skills by reproducing published, award-winning investigations.

(The ones that didn't win "real" awards win an award called "I think this project is pretty neat")

See our projects page

Reference materials

Most of the time we spend "programming" is finding things to cut and paste from the internet. Might as well put it all in one place, right? Somewhat-organized snippets to make our days go faster.

Check our references

Topics we cover

There's more than one way to dice this onion, but here's a broad overview. You might also be interested in our topics list.

design tools

Regression (aka "how X affects Y")

Learn what you really mean when you wonder if two things are "correlated."

  • Unemployment and life expectancy from the Associated Press
  • Machine bias from ProPublica
  • More from Dallas Morning News, Reveal, APM Reports, and others

Get started with regression now →

Text analysis

From counting words to the terrors of sentiment analysis, you'll be covered.

  • "Cut and paste" legislation from USA Today/Arizona Central
  • Democratic candidate topics from Bloomberg
  • More from New York Times and others

Get started with text analysis now →

books with feelings
robot saying X

Classification

No time to look at 100,000 things? Teach computers to automatically classify documents, crimes, airplanes, or anything else!

  • Misclassified crimes from LA Times
  • Finding spyplanes from BuzzFeed
  • More from The Washington Post, Atlanta Journal-Constitution, and others

Get started with classification now →

Published reproductions

Data science and machine learning can be used anywhere! From a small visualization at The Upshot to a year-long investigations by Reveal, let's try to put these new skills in context.

The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Standard sentiment analysis scores a document on a positive-vs-negative scale. Using the Emotional Lexicon, though, you can add unique emotional measurements like anger, joy, surprise, or fear.

Combine geographically granular life expectancy data with the American Community Survey to see how poverty, education, income, and demographics can affect a community.

From a list of points along a flight's path, how can you say "this looks like a surveillance plane?" And once you've found them, what do you do with the results?

When selecting a jury, both the defense and the prosecution are allowed to strike potential jurors from the pool. While the potential jurors provide answers to a questionnaire, what kind of role might race play in their selection or rejection?

Government agencies seem to fulfill or reject FOIA request without rhyme or reason. Can a journalist use machine learning to improve their chances?

Data snippets library

This doesn't need a section, but it'll feel left out if everything else gets one.

Vectorizing text

Slicing and dicing text, mostly focused on scikit-learn's vectorizers. Includes lots of tweaks for stemming, n-grams, and more.

See the snippets

Text analysis

Topic modeling, clustering, and other tools of the natural language processing trade.

See the snippets

Regression

Code snippets for performing linear and logistic regression in statsmodels, along with techniques to use and abuse the "formula" method of writing regressions.

See the snippets

Classification

Cut-and-paste-ready code to leverage scikit-learn's classifiers, and other related tasks things like feature importance and confusion matrices.

See the snippets