The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?
Practical data science for journalists (and everyone else)
If you know some Python and have dabbled in data, we're here for you! Let's add a dash of machine learning and a sprinkling of stats to your skillset.
And if you don't know Python, take this and this and call me in the morning. Or go all-in, maybe?
Topic walkthroughs
Practical, start-to-finish guides on data science concepts and tools. Not (too) boring, not (too) mathy, they're hopefully just what you're looking for.
Real-life examples
Theory on its own doesn't do much! Practice your skills by reproducing published, award-winning investigations.
(The ones that didn't win "real" awards win an award called "I think this project is pretty neat")
Reference materials
Most of the time we spend "programming" is finding things to cut and paste from the internet. Might as well put it all in one place, right? Somewhat-organized snippets to make our days go faster.
Topics we cover
There's more than one way to dice this onion, but here's a broad overview. You might also be interested in our topics list.
Regression (aka "how X affects Y")
Learn what you really mean when you wonder if two things are "correlated."
- Unemployment and life expectancy from the Associated Press
- Machine bias from ProPublica
- More from Dallas Morning News, Reveal, APM Reports, and others
Text analysis
From counting words to the terrors of sentiment analysis, you'll be covered.
- "Cut and paste" legislation from USA Today/Arizona Central
- Democratic candidate topics from Bloomberg
- More from New York Times and others
Classification
No time to look at 100,000 things? Teach computers to automatically classify documents, crimes, airplanes, or anything else!
- Misclassified crimes from LA Times
- Finding spyplanes from BuzzFeed
- More from The Washington Post, Atlanta Journal-Constitution, and others
Published reproductions
Data science and machine learning can be used anywhere! From a small visualization at The Upshot to a year-long investigations by Reveal, let's try to put these new skills in context.
Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.
A word-count analysis of the names of around 4500 museums in China.
After downloading over a hundred thousand reviews of "random chat apps," how to find reports of bullying, racism, and unwanted sexual behavior.
How to comb through 100,000 disciplinary documents without reading each individual one.
Standard sentiment analysis scores a document on a positive-vs-negative scale. Using the Emotional Lexicon, though, you can add unique emotional measurements like anger, joy, surprise, or fear.
Special interest groups use model legislation to push their agendas in state government. How can we find bills based on these "cut and paste" models?
The comment period on the FCC's net neutrality decision was flooded with bots. See how one still-in-training data scientist tackled finding re-used comments.
In the wide field of Democratic presidential candidates, who cares about what topics and how do these topics change over time?
What does Trump tweet about? An analysis of over 11,000 tweets.
Combine geographically granular life expectancy data with the American Community Survey to see how poverty, education, income, and demographics can affect a community.
p-values and the quest for "statistical significance"
An analysis of the relationship between race and city sanitation services in Milwaukee.
Some schools in Texas had an odd jump in standardized test scores between different grades. Was it cheating? Linear regression is on the case!
Using race, income, and other data to predict the performance of schools in Pinellas County, Florida. Along with a linear regression-driven critique.
Reproducing a research paper on the impact of weight on car accidents, along with a look at a state-based car crash database.
Looking at differences in access to advanced classes between schools with wealthy students and schools with poor students.
A classic piece of data journalism analyzing ticketing by Massachusetts police, and whether the race or gender of the driver might change the outcome.
A giant dataset of standardized data policing data across different states
From a list of points along a flight's path, how can you say "this looks like a surveillance plane?" And once you've found them, what do you do with the results?
Based on government-mandated data collection on mortgage granting, are certain banks or areas discriminatory in their lending practices?
When selecting a jury, both the defense and the prosecution are allowed to strike potential jurors from the pool. While the potential jurors provide answers to a questionnaire, what kind of role might race play in their selection or rejection?
In U.S. immigration courts, are certain judges and locations more likely to approve or deny claims of asylum?
An analysis of the presidential pardon process, but also a look into what to do when a very personal dataset doesn't exist.
Can an algorithm be racist? An examination of the COMPAS algorithm used as an aid in making sentencing and parole decisions. Also featuring a critique of the critique!
Government agencies seem to fulfill or reject FOIA request without rhyme or reason. Can a journalist use machine learning to improve their chances?
Data snippets library
This doesn't need a section, but it'll feel left out if everything else gets one.
Vectorizing text
Slicing and dicing text, mostly focused on scikit-learn's vectorizers. Includes lots of tweaks for stemming, n-grams, and more.
Text analysis
Topic modeling, clustering, and other tools of the natural language processing trade.
Regression
Code snippets for performing linear and logistic regression in statsmodels, along with techniques to use and abuse the "formula" method of writing regressions.
Classification
Cut-and-paste-ready code to leverage scikit-learn's classifiers, and other related tasks things like feature importance and confusion matrices.