3.3 Data: Suspicious airbag issues

Our end goal is going to be help a computer understand what a suspicious airbag comment looks like. In order to teach our computer what a suspicious airbag comment looks like, we need to read through a few ourselves first. We’ll mark which ones are suspicious and which aren’t suspicious, which will give the computer an idea of which complaints belong in which category.

We’ll need to do this low-tech style: we’ll open up the complaints database, take out some complaints (some about airbags, some not) and manually read them. We’ll do this using pandas, a Python data analysis library, but you can use most any technique to pull out a handful.

3.3.1 Extracting a sample of complaints

There are many, many columns in this dataset, so we’re going to ask pandas to display lots and lots of columns. The description column can be pretty long, so we’ll also tell pandas to display up to 100 characters in a single column.

import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

Now we’ll open FLAT_CMPL.txt, our complaints file form the NHSTA. Programming things to notice:

  • It’s tab-separated instead of comma-separated, so we use sep='\t'
  • We need encoding=latin-1 because it’s written just for English-language data
  • Some of the lines are bad, but we’ll just throw them out with error_bad_lines=False
  • There’s no header row, so we’ve pulled out all of the column names from CMPL.txt
  • Since here are a lot of codes that might have leading zeroes we’re reading in all columns as strings
  • We’re only interested in complaints from before 2015
column_names = ['CMPLID', 'ODINO', 'MFR_NAME', 'MAKETXT', 'MODELTXT', 'YEARTXT', 'CRASH', 'FAILDATE', 'FIRE', 'INJURED', 'DEATHS', 'COMPDESC', 'CITY', 'STATE', 'VIN', 'DATEA', 'LDATE', 'MILES', 'OCCURENCES', 'CDESCR', 'CMPL_TYPE', 'POLICE_RPT_YN', 'PURCH_DT', 'ORIG_OWNER_YN', 'ANTI_BRAKES_YN', 'CRUISE_CONT_YN', 'NUM_CYLS', 'DRIVE_TRAIN', 'FUEL_SYS', 'FUEL_TYPE', 'TRANS_TYPE', 'VEH_SPEED', 'DOT', 'TIRE_SIZE', 'LOC_OF_TIRE', 'TIRE_FAIL_TYPE', 'ORIG_EQUIP_YN', 'MANUF_DT', 'SEAT_TYPE', 'RESTRAINT_TYPE', 'DEALER_NAME', 'DEALER_TEL', 'DEALER_CITY', 'DEALER_STATE', 'DEALER_ZIP', 'PROD_TYPE', 'REPAIRED_YN', 'MEDICAL_ATTN', 'VEHICLES_TOWED_YN']

df = pd.read_csv("data/FLAT_CMPL.txt",
                 sep='\t',
                 dtype='str',
                 header=None,
                 error_bad_lines=False,
                 encoding='latin-1',
                 names=column_names)

# We're only interested in pre-2015
df = df[df.DATEA < '2015']

df.head()
CMPLID ODINO MFR_NAME MAKETXT MODELTXT YEARTXT CRASH FAILDATE FIRE INJURED DEATHS COMPDESC CITY STATE VIN DATEA LDATE MILES OCCURENCES CDESCR CMPL_TYPE POLICE_RPT_YN PURCH_DT ORIG_OWNER_YN ANTI_BRAKES_YN CRUISE_CONT_YN NUM_CYLS DRIVE_TRAIN FUEL_SYS FUEL_TYPE TRANS_TYPE VEH_SPEED DOT TIRE_SIZE LOC_OF_TIRE TIRE_FAIL_TYPE ORIG_EQUIP_YN MANUF_DT SEAT_TYPE RESTRAINT_TYPE DEALER_NAME DEALER_TEL DEALER_CITY DEALER_STATE DEALER_ZIP PROD_TYPE REPAIRED_YN MEDICAL_ATTN VEHICLES_TOWED_YN
0 1 958173 Ford Motor Company LINCOLN TOWN CAR 1994 Y 19941222 N 0 0 SERVICE BRAKES, HYDRAULIC:PEDALS AND LINKAGES HIGH LAND PA MI 1LNLM82W8RY 19950103 19950103 NaN 1 BRAKE PEDAL PUSH ROD RETAINER WAS NOT PROPERLY INSTALLED, CAUSING BRAKES TO FAIL, RESULTING IN AN ACCIDENT AFTER RECALL REPAIRS (94V-129). *AK EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN
1 2 958146 General Motors LLC GMC SONOMA 1995 NaN 19941215 N 0 0 SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS MOBILE AL 1GTCS19W3S8 19950103 19950103 NaN NaN VEHICLE STALLS AT HIGH SPEED, RESULTING IN LOSS OF STEERING AND BRAKING ABILITY. TT EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN
2 3 958127 Ford Motor Company FORD RANGER 1994 NaN NaN N 0 0 ENGINE AND ENGINE COOLING:EXHAUST SYSTEM N. LAUDERDAL FL NaN 19950103 19950103 NaN NaN EXHAUST SYSTEM FAILS; PLEASE DESCRIBE DETAILS. TT EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN
3 4 958170 Ford Motor Company MERCURY COUGAR 1995 NaN 19950101 N 0 0 SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS CORRAL SPRIN FL 1MELM62W5SH 19950103 19950103 NaN 1 BRAKING SYSTEM FAILURE WITHOUT ABS BRAKES. TT EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN
4 5 958149 Nissan North America, Inc.  NISSAN MAXIMA 1987 NaN 19941223 N 0 0 VISIBILITY:SUN ROOF ASSEMBLY COLUMBUS OH JN1HU11P3HX 19950103 19950103 NaN 1 VEHICLES SUN ROOF GLASS FLEW OFF WHILE DRIVING. TT EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN

We aren’t really concerned with the data itself just now, as we’re only going to take a sample. Again, most will be about airbags, some will not. We’re only going to save the complaint description because that’s all we really care about!


# Take a random sample of rows
sampled = df.sample(50)

# Take a sample involving airbags
airbag_df = df[df.CDESCR.str.contains("AIR ?BAG", na=False)]
airbag_df = airbag_df.sample(300)

# Combine them so we have all sorts for the machine to learn about
sampled = pd.concat([sampled, airbag_df])

# We don't know whether they're suspicious or not
sampled['is_suspicious'] = ''

# We only want a few columns
sampled = sampled[['is_suspicious', 'CDESCR']]

# Save them
sampled.to_csv("data/sampled-unlabeled.csv", index=False)

3.3.2 Labeling our sample of complaints

Now comes the fun part: opening up sampled-unlabeled.csv and filling in labels for each and every one of those hundreds of rows. We’ll use 1 if it’s suspicious, 0 if it isn’t.

For example, this complaint isn’t suspicious because it’s about an air bag not deploying:

DURING AN  ACCIDENT  AIR BAG'S DID NOT DEPLOY.  DEALER HAS BEEN CONTACTED.  *AK  

This next one isn’t suspicious either, because it isn’t even about airbags!

DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS.  TT

But if something involving explosions or shrapnel happens, it’s probably worth marking as suspicious:

I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE.  *TR

We’ll just go down the left-hand column in Excel, filling them in as 0 or 1. You’ll probably want to turn on Wrap Text in Excel so you don’t have to scroll left and right so much.