3.3 Data: Suspicious airbag issues

Our end goal is going to be help a computer understand what a suspicious airbag comment looks like. In order to teach our computer what a suspicious airbag comment looks like, we need to read through a few ourselves first. We’ll mark which ones are suspicious and which aren’t suspicious, which will give the computer an idea of which complaints belong in which category.

We’ll need to do this low-tech style: we’ll open up the complaints database, take out some complaints (some about airbags, some not) and manually read them. We’ll do this using pandas, a Python data analysis library, but you can use most any technique to pull out a handful.

3.3.1 Extracting a sample of complaints

There are many, many columns in this dataset, so we’re going to ask pandas to display lots and lots of columns. The description column can be pretty long, so we’ll also tell pandas to display up to 100 characters in a single column.

import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

Now we’ll open FLAT_CMPL.txt, our complaints file form the NHSTA. Programming things to notice:

It’s tab-separated instead of comma-separated, so we use sep='\t'
We need encoding=latin-1 because it’s written just for English-language data
Some of the lines are bad, but we’ll just throw them out with error_bad_lines=False
There’s no header row, so we’ve pulled out all of the column names from CMPL.txt
Since here are a lot of codes that might have leading zeroes we’re reading in all columns as strings
We’re only interested in complaints from before 2015

column_names = ['CMPLID', 'ODINO', 'MFR_NAME', 'MAKETXT', 'MODELTXT', 'YEARTXT', 'CRASH', 'FAILDATE', 'FIRE', 'INJURED', 'DEATHS', 'COMPDESC', 'CITY', 'STATE', 'VIN', 'DATEA', 'LDATE', 'MILES', 'OCCURENCES', 'CDESCR', 'CMPL_TYPE', 'POLICE_RPT_YN', 'PURCH_DT', 'ORIG_OWNER_YN', 'ANTI_BRAKES_YN', 'CRUISE_CONT_YN', 'NUM_CYLS', 'DRIVE_TRAIN', 'FUEL_SYS', 'FUEL_TYPE', 'TRANS_TYPE', 'VEH_SPEED', 'DOT', 'TIRE_SIZE', 'LOC_OF_TIRE', 'TIRE_FAIL_TYPE', 'ORIG_EQUIP_YN', 'MANUF_DT', 'SEAT_TYPE', 'RESTRAINT_TYPE', 'DEALER_NAME', 'DEALER_TEL', 'DEALER_CITY', 'DEALER_STATE', 'DEALER_ZIP', 'PROD_TYPE', 'REPAIRED_YN', 'MEDICAL_ATTN', 'VEHICLES_TOWED_YN']

df = pd.read_csv("data/FLAT_CMPL.txt",
                 sep='\t',
                 dtype='str',
                 header=None,
                 error_bad_lines=False,
                 encoding='latin-1',
                 names=column_names)

# We're only interested in pre-2015
df = df[df.DATEA < '2015']

df.head()

	CMPLID	ODINO	MFR_NAME	MAKETXT	MODELTXT	YEARTXT	CRASH	FAILDATE	FIRE	COMPDESC	CITY	STATE	VIN	DATEA	LDATE	MILES	OCCURENCES	CDESCR	CMPL_TYPE	POLICE_RPT_YN	PURCH_DT	ORIG_OWNER_YN	ANTI_BRAKES_YN	CRUISE_CONT_YN	NUM_CYLS	DRIVE_TRAIN	FUEL_SYS	FUEL_TYPE	TRANS_TYPE	VEH_SPEED	DOT	TIRE_SIZE	LOC_OF_TIRE	TIRE_FAIL_TYPE	ORIG_EQUIP_YN	MANUF_DT	SEAT_TYPE	RESTRAINT_TYPE	DEALER_NAME	DEALER_TEL	DEALER_CITY	DEALER_STATE	DEALER_ZIP	PROD_TYPE	REPAIRED_YN	MEDICAL_ATTN	VEHICLES_TOWED_YN
0	1	958173	Ford Motor Company	LINCOLN	TOWN CAR	1994	Y	19941222	N	SERVICE BRAKES, HYDRAULIC:PEDALS AND LINKAGES	HIGH LAND PA	MI	1LNLM82W8RY	19950103	19950103	NaN	1	BRAKE PEDAL PUSH ROD RETAINER WAS NOT PROPERLY INSTALLED, CAUSING BRAKES TO FAIL, RESULTING IN AN ACCIDENT AFTER RECALL REPAIRS (94V-129). *AK	EVOQ	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	V	NaN	NaN	NaN
1	2	958146	General Motors LLC	GMC	SONOMA	1995	NaN	19941215	N	SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS	MOBILE	AL	1GTCS19W3S8	19950103	19950103	NaN	NaN	VEHICLE STALLS AT HIGH SPEED, RESULTING IN LOSS OF STEERING AND BRAKING ABILITY. TT	EVOQ	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	V	NaN	NaN	NaN
2	3	958127	Ford Motor Company	FORD	RANGER	1994	NaN	NaN	N	ENGINE AND ENGINE COOLING:EXHAUST SYSTEM	N. LAUDERDAL	FL	NaN	19950103	19950103	NaN	NaN	EXHAUST SYSTEM FAILS; PLEASE DESCRIBE DETAILS. TT	EVOQ	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	V	NaN	NaN	NaN
3	4	958170	Ford Motor Company	MERCURY	COUGAR	1995	NaN	19950101	N	SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS	CORRAL SPRIN	FL	1MELM62W5SH	19950103	19950103	NaN	1	BRAKING SYSTEM FAILURE WITHOUT ABS BRAKES. TT	EVOQ	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	V	NaN	NaN	NaN
4	5	958149	Nissan North America, Inc.	NISSAN	MAXIMA	1987	NaN	19941223	N	VISIBILITY:SUN ROOF ASSEMBLY	COLUMBUS	OH	JN1HU11P3HX	19950103	19950103	NaN	1	VEHICLES SUN ROOF GLASS FLEW OFF WHILE DRIVING. TT	EVOQ	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	V	NaN	NaN	NaN

We aren’t really concerned with the data itself just now, as we’re only going to take a sample. Again, most will be about airbags, some will not. We’re only going to save the complaint description because that’s all we really care about!


# Take a random sample of rows
sampled = df.sample(50)

# Take a sample involving airbags
airbag_df = df[df.CDESCR.str.contains("AIR ?BAG", na=False)]
airbag_df = airbag_df.sample(300)

# Combine them so we have all sorts for the machine to learn about
sampled = pd.concat([sampled, airbag_df])

# We don't know whether they're suspicious or not
sampled['is_suspicious'] = ''

# We only want a few columns
sampled = sampled[['is_suspicious', 'CDESCR']]

# Save them
sampled.to_csv("data/sampled-unlabeled.csv", index=False)

3.3.2 Labeling our sample of complaints

Now comes the fun part: opening up sampled-unlabeled.csv and filling in labels for each and every one of those hundreds of rows. We’ll use 1 if it’s suspicious, 0 if it isn’t.

For example, this complaint isn’t suspicious because it’s about an air bag not deploying:

DURING AN  ACCIDENT  AIR BAG'S DID NOT DEPLOY.  DEALER HAS BEEN CONTACTED.  *AK

This next one isn’t suspicious either, because it isn’t even about airbags!

DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS.  TT

But if something involving explosions or shrapnel happens, it’s probably worth marking as suspicious:

I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE.  *TR

We’ll just go down the left-hand column in Excel, filling them in as 0 or 1. You’ll probably want to turn on Wrap Text in Excel so you don’t have to scroll left and right so much.