Practical data science for journalists (and everyone else)

If you know some Python and have dabbled in data, we're here for you! Let's add a dash of machine learning and a sprinkling of stats to your skillset.

And if you don't know Python, take this and this and call me in the morning. Or go all-in, maybe?

Topic walkthroughs

Practical, start-to-finish guides on data science concepts and tools. Not (too) boring, not (too) mathy, they're hopefully just what you're looking for.

See our topics guide

Real-life examples

Theory on its own doesn't do much! Practice your skills by reproducing published, award-winning investigations.

(The ones that didn't win "real" awards win an award called "I think this project is pretty neat")

See our projects page

Reference materials

Most of the time we spend "programming" is finding things to cut and paste from the internet. Might as well put it all in one place, right? Somewhat-organized snippets to make our days go faster.

Check our references

Topics we cover

There's more than one way to dice this onion, but here's a broad overview. You might also be interested in our topics list.

Regression (aka "how X affects Y")

Learn what you really mean when you wonder if two things are "correlated."

Unemployment and life expectancy from the Associated Press
Machine bias from ProPublica
More from Dallas Morning News, Reveal, APM Reports, and others

Get started with regression now →

Text analysis

From counting words to the terrors of sentiment analysis, you'll be covered.

"Cut and paste" legislation from USA Today/Arizona Central
Democratic candidate topics from Bloomberg
More from New York Times and others

Get started with text analysis now →

Classification

No time to look at 100,000 things? Teach computers to automatically classify documents, crimes, airplanes, or anything else!

Misclassified crimes from LA Times
Finding spyplanes from BuzzFeed
More from The Washington Post, Atlanta Journal-Constitution, and others

Get started with classification now →

Published reproductions

Data science and machine learning can be used anywhere! From a small visualization at The Upshot to a year-long investigations by Reveal, let's try to put these new skills in context.

Searching for faulty airbags in vehicle complaints

The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

The New York Times

Building a crime classification engine

Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.

Los Angeles Times

Chinese museum analysis

A word-count analysis of the names of around 4500 museums in China.

Caixin

Analyzing online safety through app store reviews

After downloading over a hundred thousand reviews of "random chat apps," how to find reports of bullying, racism, and unwanted sexual behavior.

The Washington Post

Uncovering abusive doctors that were allowed to continue practicing

How to comb through 100,000 disciplinary documents without reading each individual one.

Atlanta Journal-Constitution

Analyzing the tone of Trump's speeches

Standard sentiment analysis scores a document on a positive-vs-negative scale. Using the Emotional Lexicon, though, you can add unique emotional measurements like anger, joy, surprise, or fear.

The New York Times

Detecting special interest model legislation in state laws

Special interest groups use model legislation to push their agendas in state government. How can we find bills based on these "cut and paste" models?

USA Today, The Arizona Republic, and the Center for Public Integrity

Detecting bots in FCC comment submissions

The comment period on the FCC's net neutrality decision was flooded with bots. See how one still-in-training data scientist tackled finding re-used comments.

Figuring out what Democratic candidates care about

In the wide field of Democratic presidential candidates, who cares about what topics and how do these topics change over time?

Bloomberg

What does Trump tweet about?

What does Trump tweet about? An analysis of over 11,000 tweets.

The New York Times

Examining life expectancy at the local level

Combine geographically granular life expectancy data with the American Community Survey to see how poverty, education, income, and demographics can affect a community.

The Associated Press

p values and p-hacking

p-values and the quest for "statistical significance"

FiveThirtyEight

Predicting delays in patching potholes based on demographics

An analysis of the relationship between race and city sanitation services in Milwaukee.

Milwaukee Journal-Sentinel

Finding cheating schools in Texas with linear regression

Some schools in Texas had an odd jump in standardized test scores between different grades. Was it cheating? Linear regression is on the case!

Dallas Morning News

Measuring the impact of re-segregation on Florida elementary schools

Using race, income, and other data to predict the performance of schools in Pinellas County, Florida. Along with a linear regression-driven critique.

Tampa Bay Times

Analyzing whether larger cars cause more deadly crashes

Reproducing a research paper on the impact of weight on car accidents, along with a look at a state-based car crash database.

Review of Economic Studies

Tracking equal access to school programs

Looking at differences in access to advanced classes between schools with wealthy students and schools with poor students.

ProPublica

Investigating who gets a ticket and who gets a warning

A classic piece of data journalism analyzing ticketing by Massachusetts police, and whether the race or gender of the driver might change the outcome.

The Boston Globe

Stanford Open Policing Data

A giant dataset of standardized data policing data across different states

Uncovering surveillance planes with BuzzFeed

From a list of points along a flight's path, how can you say "this looks like a surveillance plane?" And once you've found them, what do you do with the results?

BuzzFeed News

Analyzing mortgage rejections for racial bias

Based on government-mandated data collection on mortgage granting, are certain banks or areas discriminatory in their lending practices?

Reveal

Bias in the jury selection process

When selecting a jury, both the defense and the prosecution are allowed to strike potential jurors from the pool. While the potential jurors provide answers to a questionnaire, what kind of role might race play in their selection or rejection?

APM Reports

Analyzing the impact of particular judges on the US asylum process

In U.S. immigration courts, are certain judges and locations more likely to approve or deny claims of asylum?

Reuters

Investigating who receives presidential pardons

An analysis of the presidential pardon process, but also a look into what to do when a very personal dataset doesn't exist.

ProPublica

An analysis of racial bias in criminal sentencing

Can an algorithm be racist? An examination of the COMPAS algorithm used as an aid in making sentencing and parole decisions. Also featuring a critique of the critique!

ProPublica

Predicting FOIA requests success rates

Government agencies seem to fulfill or reject FOIA request without rhyme or reason. Can a journalist use machine learning to improve their chances?

data.world

Data snippets library

This doesn't need a section, but it'll feel left out if everything else gets one.

Vectorizing text

Slicing and dicing text, mostly focused on scikit-learn's vectorizers. Includes lots of tweaks for stemming, n-grams, and more.

See the snippets

Text analysis

Topic modeling, clustering, and other tools of the natural language processing trade.

See the snippets

Regression

Code snippets for performing linear and logistic regression in statsmodels, along with techniques to use and abuse the "formula" method of writing regressions.

See the snippets

Classification

Cut-and-paste-ready code to leverage scikit-learn's classifiers, and other related tasks things like feature importance and confusion matrices.

See the snippets

About the site

Hi, I'm Soma, welcome to Data Science for Journalism a.k.a. investigate.ai!

There's been a lot of buzz about machine learning and "artificial intelligence" being used in stories over the past few years. It's mostly not that complicated - a little stats, a classifier here or there - but it's hard to know where to start without a little help.

If you know a little Python programming, hopefully this site can be that help! Learn more about this project here.

Our newsletter

Links

Thanks to Columbia Journalism School, the Knight Foundation, and many others.