Searching for faulty airbags in vehicle complaints
The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?
classification text analysis NHSTA logistic regression natural language processing feature selection reading lots of documents
Readings and links
The New York Times wanted to track down related to a dangerous Takata airbag recall using a large dataset of consumer complaints from the National Highway Transportation Safety Administration. At well over a gigabyte, though, it was a slow and laborious process.
While one reporter combed through datasets manually, a member of the business-side data science team leveraged machine learning to try to track down suspicious complaints.
This is not really an accurate reproduction of the project, but rather a series of rather simple exercises meant to teach the basic concepts of text analysis and classification, such as vectorization, comparing classifiers, and understanding evaluation matrices. It also highlights the question of when machine learning might be helpful, as compared to just doing simple searches with manual filtering.
Spoiler alert: the whole point is to not get good results, no matter how many data science concepts you throw at the problem. In the end you should just read and label more complaints, or (more honestly) just build a complaint search engine for the reporter to use.
Notebooks, Assignments, and Walkthroughs
A walkthrough of building your very first classifier. It won't work very well, but you'll learn the basics.
Using a handcrafted set of words, we build a logistic regression classifier to track down suspicious airbag-related consumer complaints.
An upgrade of the previous attempt, we examine if using decision trees and random forests lead to an improvement in results.
Another attempt to build a classifier to track down suspicious airbag-related consumer complaints. This time we'll use a vectorizer to automatically count the words in a document.