logo for About investigate.ai

About investigate.ai

I can't think of anything useful or clever to put here, so let me just say this cookie recipe is really good.

Hi, I'm Soma!

For the past four or five years I've run the Lede Program, a non-degree summer program at Columbia's Journalism School that teaches journalists all sorts of data-y stuff. I'm also involved with the new Data Journalism MS program.

They're both short programs, so we can't always cover all of the topics I'd like. Investigate.ai is me taking some of the fancier pieces, fleshing them out, and putting them on the internet to be cruelly ridiculed. Some of these pieces:

It was originally called "Data Science for Journalism," and while ds4j was incredibly cute the URL investigate.ai was just so cheesy enough to be infinitely memorable.

I have found a terrible, terrible error (or a much, much better way to do one of these things)

Thanks for letting me know, you can find me at jonathan.soma@gmail.com or @dangerscarf.

How should I learn Python?

I wrote a tutorial that I personally enjoy, although it's more of the basic basics. After that you'll want to get into data analysis with the pandas library. Some people like First Python Notebook, so might as well give it a shot. If you'd like to burn a whole summer on this stuff you can always come hang out at Columbia.

Machine learning isn't AI! And thus, investigate.ai is a terrible name.


There's this whole issue with .ml domains so let's just accept that we now live in an .ai world. But if you'd prefer a hot take: machine learning and data science is statistics with a pay raise, and artificial intelligence is what the marketing departments calls anything involving a computer.

(and at this moment I'm a one-man marketing department)

Where'd this all come from?

The Lede Program has had a class called Algorithms for the past few years, where we talk about how to perform machine learning as well as how algorithms make decisions all around us (e.g. filter bubbles, predictive policing, etc). It's been taught a lot of different ways by a handful of different instructors, and this is me building content around an approach that I think works well.

What's the approach?

Machine learning as a concept is seductive, and the best way to demystify it is to learn it and practice it. Instead of building up from math and first concepts, we put the tools to use building projects and try to talk a lot about what could go wrong. Journalists like doing stuff, so we give them stuff to do.

Aspirationally, the students learn:

  • Machine learning is not magic
  • It's uncomfortably easy to do machine learning
  • Machine learning often and easily easily produces garbage
  • "But the computer did it" is not an excuse

And that despite all this, machine learning can amazingly, actually, sometimes be useful.

Go on, please rant a little

I've tried to structure this material in a way that's hopefully a bit different than most machine learning courses. I could spend all the time in the world using words like eigenvalues and false negatives and f-1 scores but (I think, mostly) that... just isn't the important stuff to me, and it can be kind of alienating.

I think the true decisions that you make when using machine learning are where you sit down and think about what your algorithm is doing and what happens when it's put into use. It helps you figure out the difference between an output of "an intern has to read more documents" vs "more people go to prison." Going from this practical angle provides, in my opinion, more responsible education than learning terms like "false positives."

For example, The Elements of AI Course is delightful, but the "Implications" module gets thrown to the end. Same thing with the otherwise-wonderful fast.ai course - "Ethics" is just one part of the second-to-last lesson.

It just seems like this stuff could be baked in earlier, and not necessarily separately. I get that "what are the implications of incorrectly identifying a Iris setosa as an Iris versicolor?" isn't the easiest thing in the world, but maybe the traditional datasets could use a refresh, too.

(The 1935 paper the iris dataset is from has wicked visuals, though. And I don't think it's impossible to discuss what an ethics and responsible approach to classifying irises is, I just haven't done it yet!)

You seem somewhat dismissive of math

Math is great! It does intimidate some people, though.

Knowing how Bayes Theorem or linear algebra works is excellent for some aspects of optimization and determining the weaknesses of this or that approach or algorithm, but not knowing the details shouldn't stop you from building a tool to help search through 130,000 app store reviews.

Understanding the possible side effects of your tools is more important than knowing how the tools work. Even experts are terrible at that, even though the effects (including side effects) are the reason we use machine learning in the first place.

And I mean, let's be honest, I don't know if general adherence to the Math Gods has worked out so well.

Why don't you talk about TensorFlow PyTorch BERT etc?

I'd like to, actually, especially since the Google Colab notebooks make it all super easy. I'm a big fan of people being able to run things on their own machines, though, so I figured we'd start with sklearn and build the rest on later.

I am very unhappy with your constant use of bold

I'm easily distracted and I find it an easier read! Helps me focus. Kinda like this.

Credits and thank-yous

Big thanks to Columbia's J-School and The Knight Foundation, who are the reasons this is a nice $free website.

Infinite appreciation to the journalists who originally published the pieces reproduced here, especially those who took the time to chat with me, including Aaron Glantz, Ben Poston, Emmanuel Martinez, Jeff Ernsthausen, Jennifer LaFleur, John Keefe, Nathaniel Lash, Nicky Forster, Peter Aldhous, Ready Levinson, Ryan McNeill, and Will Craft.

A thousand thank-yours to the instructors for The Lede Program's Algorithms course, for developing new content every year and making continual adjustments in the name of Making A Good Class: Priyanjana Bengali, Chase Davis, Elizabeth Dollar, Richard Dunks, Chris Wiggins, and Jonathan Stray.

And all my students from Lede and the Data MS, whose summers of suffering through learning Python made this all possible! Get ready for a very long list to show up here.