Searching for faulty airbags in vehicle complaints

The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

classification text analysis NHSTA logistic regression natural language processing feature selection reading lots of documents

Summary

The New York Times wanted to track down related to a dangerous Takata airbag recall using a large dataset of consumer complaints from the National Highway Transportation Safety Administration. At well over a gigabyte, though, it was a slow and laborious process.

While one reporter combed through datasets manually, a member of the business-side data science team leveraged machine learning to try to track down suspicious complaints.

This is not really an accurate reproduction of the project, but rather a series of rather simple exercises meant to teach the basic concepts of text analysis and classification, such as vectorization, comparing classifiers, and understanding evaluation matrices. It also highlights the question of when machine learning might be helpful, as compared to just doing simple searches with manual filtering.

Spoiler alert: the whole point is to not get good results, no matter how many data science concepts you throw at the problem. In the end you should just read and label more complaints, or (more honestly) just build a complaint search engine for the reporter to use.

Notebooks, Assignments, and Walkthroughs

Complete walkthrough

A walkthrough of building your very first classifier. It won't work very well, but you'll learn the basics.

Multi-page walkthrough

A simplistic reproduction of the NYT's research using logistic regression

Using a handcrafted set of words, we build a logistic regression classifier to track down suspicious airbag-related consumer complaints.

A decision-tree reproduction of the NYT's research

An upgrade of the previous attempt, we examine if using decision trees and random forests lead to an improvement in results.

Combining a text vectorizer and a classifier to track down suspicious complaints

Another attempt to build a classifier to track down suspicious airbag-related consumer complaints. This time we'll use a vectorizer to automatically count the words in a document.