Home Projects Topics Search Newsletter About
  • investigate.ai
  • Classification
  • Comparing classifiers

Read online Download notebook Interactive version

About the site

Hi, I'm Soma, welcome to Data Science for Journalism a.k.a. investigate.ai!

There's been a lot of buzz about machine learning and "artificial intelligence" being used in stories over the past few years. It's mostly not that complicated - a little stats, a classifier here or there - but it's hard to know where to start without a little help.

If you know a little Python programming, hopefully this site can be that help! Learn more about this project here.

Our newsletter

Links

  • jonathan.soma@gmail.com
  • @dangerscarf
  • Privacy policy
  • Newsletter
  • Images via icons8

Thanks to Columbia Journalism School, the Knight Foundation, and many others.

investigate.ai
data science for everybody

Text analysis

  1. Types of text analysis
    1. Simple word counting
    2. Counting words across many documents
    3. Segmenting words in East Asian languages
    4. Project: Caixin museums
    5. Using scikit-learn vectorizers with East Asian languages
    1. Upgraded word counts with TF-IDF
    2. Multi-word phrases and n-grams
    3. Standardizing text with stemming and lemmatization
    4. Using TF-IDF with Chinese text
    1. Comparing sentiment analysis tools
    2. Design your own sentiment analyzer
    3. Improving your tool
    4. NRC Emotional Lexicon
    5. Project: UpShot State of the Union
    6. Project: NYT Trump tweets
    1. Converting documents to text (English)
    2. Converting documents to text (non-English)
    1. Extracting topics from documents
    2. Choosing the right number of topics
    3. Topic models with Gensim
    4. Topic models vs clustering
    5. Entity recognition
    6. Intro to word embeddings
    7. Conceptual document similarity
    8. Comparing documents in different languages

Putting things in categories automatically

  1. Introduction to Classification
    1. Evaluating classifiers
    2. Categorical features
    3. Classifiers with text
    4. Correcting for imbalanced datasets
    1. BuzzFeed: Spy planes
    2. WaPo chat: App reviews
    3. NYT: Faulty airbag search
    4. LA Times: crime classifier

How X affects Y

  1. Finding relationships with regression
    1. Linear regression (Quickstart)
    2. Linear regression for humans
    3. Putting regression to use
    4. Evaluating regressions
    5. Associated Press: Life expectancy and unemployment
    1. Logistic regression (Quickstart)
    2. Logistic regression for humans
    3. More complex logistic regressions
    4. Evaluating logistic regressions
    5. Boston Globe: Speeding tickets
    6. APM Reports: Jury selection

Python data science reference

  1. Introduction
  2. Vectorizing
  3. Text Analysis
  4. Regression
  5. Classification

All Projects

  1. Project Summaries
    1. Summary
    2. A simplistic reproduction of the NYT's research using logistic regression
    3. A decision-tree reproduction of the NYT's research
    4. Combining a text vectorizer and a classifier to track down suspicious complaints
    1. Summary
    2. Predicting downgraded assaults with machine learning
    3. Taking a closer look at our classifier and its misclassifications
    4. Trying out and combining different classifiers
    1. Summary
    2. Chinese museum dataset cleanup
    3. Chinese museums per capita analysis
    4. Counting words in Chinese museum names
    1. Summary
    2. Scrape and combine app store reviews
    3. Build a classifier to detect reviews about bad behavior
  2. AJC: Doctors and sex abuse
    1. Summary
    2. An introduction to the NRC Emotional Lexicon
    3. Reproducing The UpShot's Trump State of the Union visualization
    1. Summary
    2. Downloading one million pieces of legislation from LegiScan
    3. Taking a million pieces of legislation from a CSV and inserting them into Postgres
    4. Download Word, PDF and HTML content and process it into text with Tika
    5. Import content into Solr for advanced text searching
    6. Checking for legislative text reuse using Python, Solr, and ngrams
    7. Checking for legislative text reuse using Python, Solr, and simple text search
    8. Search for model legislation in over one million bills using Postgres and Solr
    9. Using topic modeling to categorize legislation
  3. FCC comment bots
    1. Summary
    2. Downloading all 2019 tweets from Democratic presidential candidates
    3. Using topic modeling to analyze presidential candidate tweets
    4. Assigning categories to tweets using keyword matching
    5. Building streamgraphs from categorized and dated datasets
  4. NYT: Trump tweets
    1. Summary
    2. Simple logistic regression using statsmodels (formula version)
    3. Simple logistic regression using statsmodels (dataframes version)
  5. FiveThirtyEight: P-values
    1. Summary
    2. Pothole geographic analysis and linear regression, complete walkthrough
    3. Pothole demographics linear regression, no spatial analysis
    1. Summary
    2. Finding outliers with standard deviation and regression
    3. Finding outliers with regression residuals (short version)
    4. Reproducing the graphics from The Dallas Morning News piece
    1. Summary
    2. Linear regression on Florida schools, complete walkthrough
    3. Linear regression on Florida schools, no cleaning
    1. Summary
    2. Feature selection and engineering
    3. Combine Excel files across multiple sheets and save as CSV files
    4. Create make model weights csv
    5. Find car data from VINs
    6. Combine VINs and weights
    7. Clean combine and filter data
  6. ProPublica: Opportunity Gap
    1. Summary
    2. Logistic regression for speeding tickets
  7. Stanford: Open Policing Data
    1. Summary
    2. Feature engineering - BuzzFeed spy planes
    3. Drawing flight paths on maps with cartopy
    4. Finding surveillance planes using random forests
    1. Summary
    2. Cleaning and combining data for the Reveal Mortgage Analysis
    3. Wild formulas in statsmodels using Patsy (short version)
    4. Reveal Mortgage Analysis - Logistic Regression using statsmodels formulas
    5. Reveal Mortgage Analysis - Logistic Regression
    1. Summary
    2. Combining and cleaning the initial dataset
    3. Picking what matters and what doesn't in a regression
    4. Analyzing data using statsmodels formulas
    5. Alternative techniques with statsmodels formulas
    1. Summary
    2. Preparing the EOIR immigration court data for analysis
    3. How nationality and judges affect your chance of asylum in immigration court
  8. ProPublica: Presidential pardons
    1. Summary
    2. Breaking down machine bias
  9. data.world: The FOIA Predictor