The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?
Journalism + data science projects
Practical, real-world examples of machine learning and statistics used in journalism.
Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.
A word-count analysis of the names of around 4500 museums in China.
After downloading over a hundred thousand reviews of "random chat apps," how to find reports of bullying, racism, and unwanted sexual behavior.
How to comb through 100,000 disciplinary documents without reading each individual one.
Standard sentiment analysis scores a document on a positive-vs-negative scale. Using the Emotional Lexicon, though, you can add unique emotional measurements like anger, joy, surprise, or fear.
Special interest groups use model legislation to push their agendas in state government. How can we find bills based on these "cut and paste" models?
The comment period on the FCC's net neutrality decision was flooded with bots. See how one still-in-training data scientist tackled finding re-used comments.
In the wide field of Democratic presidential candidates, who cares about what topics and how do these topics change over time?
What does Trump tweet about? An analysis of over 11,000 tweets.
Combine geographically granular life expectancy data with the American Community Survey to see how poverty, education, income, and demographics can affect a community.
p-values and the quest for "statistical significance"
An analysis of the relationship between race and city sanitation services in Milwaukee.
Some schools in Texas had an odd jump in standardized test scores between different grades. Was it cheating? Linear regression is on the case!
Using race, income, and other data to predict the performance of schools in Pinellas County, Florida. Along with a linear regression-driven critique.
Reproducing a research paper on the impact of weight on car accidents, along with a look at a state-based car crash database.
Looking at differences in access to advanced classes between schools with wealthy students and schools with poor students.
A classic piece of data journalism analyzing ticketing by Massachusetts police, and whether the race or gender of the driver might change the outcome.
A giant dataset of standardized data policing data across different states
From a list of points along a flight's path, how can you say "this looks like a surveillance plane?" And once you've found them, what do you do with the results?
Based on government-mandated data collection on mortgage granting, are certain banks or areas discriminatory in their lending practices?
When selecting a jury, both the defense and the prosecution are allowed to strike potential jurors from the pool. While the potential jurors provide answers to a questionnaire, what kind of role might race play in their selection or rejection?
In U.S. immigration courts, are certain judges and locations more likely to approve or deny claims of asylum?
An analysis of the presidential pardon process, but also a look into what to do when a very personal dataset doesn't exist.
Can an algorithm be racist? An examination of the COMPAS algorithm used as an aid in making sentencing and parole decisions. Also featuring a critique of the critique!
Government agencies seem to fulfill or reject FOIA request without rhyme or reason. Can a journalist use machine learning to improve their chances?