Data science topics for journalists

If you're not quite sure where to start or what to do, let's break down some of your options.

Natural Language Processing

"Natural language processing" is the fancy-words version of "text stuff." It's a wide wide category that you can take in a lot of directions.

Start by learning to count words simply or a more complex way. Once you're set with single words, triumph over multi-word phrases with the power of n-grams.
As long as you can split your documents into separate words, most every NLP techniques can be applied to any language. While Western languages can be separating using spaces, splitting apart words in East Asian languages takes an extra step.
Tame multiple forms of the same word - like swim, swmming and swims - with the powers of stemming and lemmatization.
Skip over overly simplistic airbag analysis and go right to either sentiment analysis or app reviews. Did automatically categorizing things pique your interest? Spend some more time in our classification section.
If you're battling the demons of scanned documents or PDFs, learn to convert PDFs, Word docs and web pages to text.
Shift to finding what documents are about with topic modeling and clustering, and see if they're helpful to figure out what democratic candidates are talking about.
Once you're feeling especially powerful, take a chance on detecting special interest legislation.
Relax with a little more sentiment analysis, but a slightly more complex (and troubled, perhaps?) version

How X affects Y

Seeing how topics affect one another is the domain of regression. To the chagrin of mathematicians everywhere, we skip over the details and go right to the application. Be sure you check your findings with experts before you leap out with especially spicy takes!

Learn the basics of linear regression, then apply the "change in X, change in Y" concept with the Associated Press or the Milwaukee Journal-Sentinel.
Are these schools cheating? Are there schools failing? Use the regression predictions for test-score-based investigations alongside the Dallas Morning News or The Tampa Bay Times.
Pivot from predicting numbers (linear regression) to predicting yes/no outputs with logistic regression. Try your hand at reproducing projects on bias in ticketing, mortgages, or jury selection.
Your capstone is ProPublica's COMPAS analysis.

I don't want to read a thousand documents

Welcome to "I have a million documents but not enough interns to read them!" From app store reviews to medical board directives, teach a computer to do the hard work for you.

Start by learning the basics of counting words.
Automatically finding the topics inside is usually pretty useless, but give it a shot with topic modeling and clustering.
A better option is classification, where you annotate a handful and let the computer process the rest. Learn from others' mistakes with the NYT's airbag analysis, then level up to app reviews.
Finally, build a crime classifier to cover any other basis you might need.

A full walkthrough

Want to go start to finish? Each project has a slightly different idea of what knowledge you bring to the table, so here is one potential path you could take:

Journalists love sentiment analysis, so I find it useful to start with showing the simplicity and shortfalls of machine learning through a very short introduction to sentiment analysis. This teaches simple text analysis as well as the basics of building a classification system.
After that, you can run through other text-based classification examples, such as the NYT Takata airbags and Los Angeles Times crime misclassification pieces. The Takata airbag piece is an excellent example of when you should just use a simple text search instead of ML, while the LA Times crime misclassification piece has a million and one editorial and ethical concerns.
You can make the "how do inputs relate to outputs?" jump to linear regression with the AP's regression on life expectancy, examining outputs and talking through what the worth of a p value is. When you'd like to stop worrying about correlation vs causation, move onward to see the use of comparing predictions to reality with the Dallas Morning News' cheating scandal. Think about how to write about these topics and argue about about what's publishable with the Milwaukee Journal-Sentinel piece on pothole fill times.
After you've had your fill of predicting numeric outcomes, bring things full circle back to classification and predicting categories using BuzzFeed's surveillance planes piece. And speaking of categories, what about logistic regression? It's time for odds ratios as you brace yourself for a barrage of stories on racism with a 2003 piece from The Boston Globe and a more recent Reveal series on mortgages.
In the end you can finish things up with a look at the dirtiness of data and process in Reuter's asylum analysis and the FOIA predictor, along with reverse engineering the opposition in ProPublica's COMPAS analysis.

This doesn't include all of the pieces! Make sure to poke around to see what your other options are.

Data science topics for journalists

Natural Language Processing

How X affects Y

I don't want to read a thousand documents

A full walkthrough

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects