Searching for faulty airbags in vehicle complaints from The New York Times

Searching for faulty airbags in vehicle complaints

The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

classification text analysis NHSTA logistic regression natural language processing feature selection reading lots of documents

Readings and links

Summary

The New York Times wanted to track down related to a dangerous Takata airbag recall using a large dataset of consumer complaints from the National Highway Transportation Safety Administration. At well over a gigabyte, though, it was a slow and laborious process.

While one reporter combed through datasets manually, a member of the business-side data science team leveraged machine learning to try to track down suspicious complaints.

This is not really an accurate reproduction of the project, but rather a series of rather simple exercises meant to teach the basic concepts of text analysis and classification, such as vectorization, comparing classifiers, and understanding evaluation matrices. It also highlights the question of when machine learning might be helpful, as compared to just doing simple searches with manual filtering.

Spoiler alert: the whole point is to not get good results, no matter how many data science concepts you throw at the problem. In the end you should just read and label more complaints, or (more honestly) just build a complaint search engine for the reporter to use.

Notebooks, Assignments, and Walkthroughs

Complete walkthrough

A walkthrough of building your very first classifier. It won't work very well, but you'll learn the basics.

Read online

Multi-page walkthrough

A simplistic reproduction of the NYT's research using logistic regression

Using a handcrafted set of words, we build a logistic regression classifier to track down suspicious airbag-related consumer complaints.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

A decision-tree reproduction of the NYT's research

An upgrade of the previous attempt, we examine if using decision trees and random forests lead to an improvement in results.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Combining a text vectorizer and a classifier to track down suspicious complaints

Another attempt to build a classifier to track down suspicious airbag-related consumer complaints. This time we'll use a vectorizer to automatically count the words in a document.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Searching for faulty airbags in vehicle complaints

Readings and links

Summary

Notebooks, Assignments, and Walkthroughs

Complete walkthrough

A simplistic reproduction of the NYT's research using logistic regression

A decision-tree reproduction of the NYT's research

Combining a text vectorizer and a classifier to track down suspicious complaints

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects