3.3 Data: Suspicious airbag issues
Our end goal is going to be help a computer understand what a suspicious airbag comment looks like. In order to teach our computer what a suspicious airbag comment looks like, we need to read through a few ourselves first. We’ll mark which ones are suspicious and which aren’t suspicious, which will give the computer an idea of which complaints belong in which category.
We’ll need to do this low-tech style: we’ll open up the complaints database, take out some complaints (some about airbags, some not) and manually read them. We’ll do this using pandas, a Python data analysis library, but you can use most any technique to pull out a handful.
3.3.1 Extracting a sample of complaints
There are many, many columns in this dataset, so we’re going to ask pandas to display lots and lots of columns. The description column can be pretty long, so we’ll also tell pandas to display up to 100 characters in a single column.
import pandas as pd
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)
Now we’ll open FLAT_CMPL.txt
, our complaints file form the NHSTA. Programming things to notice:
- It’s tab-separated instead of comma-separated, so we use
sep='\t'
- We need
encoding=latin-1
because it’s written just for English-language data - Some of the lines are bad, but we’ll just throw them out with
error_bad_lines=False
- There’s no header row, so we’ve pulled out all of the column names from
CMPL.txt
- Since here are a lot of codes that might have leading zeroes we’re reading in all columns as strings
- We’re only interested in complaints from before 2015
column_names = ['CMPLID', 'ODINO', 'MFR_NAME', 'MAKETXT', 'MODELTXT', 'YEARTXT', 'CRASH', 'FAILDATE', 'FIRE', 'INJURED', 'DEATHS', 'COMPDESC', 'CITY', 'STATE', 'VIN', 'DATEA', 'LDATE', 'MILES', 'OCCURENCES', 'CDESCR', 'CMPL_TYPE', 'POLICE_RPT_YN', 'PURCH_DT', 'ORIG_OWNER_YN', 'ANTI_BRAKES_YN', 'CRUISE_CONT_YN', 'NUM_CYLS', 'DRIVE_TRAIN', 'FUEL_SYS', 'FUEL_TYPE', 'TRANS_TYPE', 'VEH_SPEED', 'DOT', 'TIRE_SIZE', 'LOC_OF_TIRE', 'TIRE_FAIL_TYPE', 'ORIG_EQUIP_YN', 'MANUF_DT', 'SEAT_TYPE', 'RESTRAINT_TYPE', 'DEALER_NAME', 'DEALER_TEL', 'DEALER_CITY', 'DEALER_STATE', 'DEALER_ZIP', 'PROD_TYPE', 'REPAIRED_YN', 'MEDICAL_ATTN', 'VEHICLES_TOWED_YN']
df = pd.read_csv("data/FLAT_CMPL.txt",
sep='\t',
dtype='str',
header=None,
error_bad_lines=False,
encoding='latin-1',
names=column_names)
# We're only interested in pre-2015
df = df[df.DATEA < '2015']
df.head()
CMPLID | ODINO | MFR_NAME | MAKETXT | MODELTXT | YEARTXT | CRASH | FAILDATE | FIRE | INJURED | DEATHS | COMPDESC | CITY | STATE | VIN | DATEA | LDATE | MILES | OCCURENCES | CDESCR | CMPL_TYPE | POLICE_RPT_YN | PURCH_DT | ORIG_OWNER_YN | ANTI_BRAKES_YN | CRUISE_CONT_YN | NUM_CYLS | DRIVE_TRAIN | FUEL_SYS | FUEL_TYPE | TRANS_TYPE | VEH_SPEED | DOT | TIRE_SIZE | LOC_OF_TIRE | TIRE_FAIL_TYPE | ORIG_EQUIP_YN | MANUF_DT | SEAT_TYPE | RESTRAINT_TYPE | DEALER_NAME | DEALER_TEL | DEALER_CITY | DEALER_STATE | DEALER_ZIP | PROD_TYPE | REPAIRED_YN | MEDICAL_ATTN | VEHICLES_TOWED_YN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 958173 | Ford Motor Company | LINCOLN | TOWN CAR | 1994 | Y | 19941222 | N | 0 | 0 | SERVICE BRAKES, HYDRAULIC:PEDALS AND LINKAGES | HIGH LAND PA | MI | 1LNLM82W8RY | 19950103 | 19950103 | NaN | 1 | BRAKE PEDAL PUSH ROD RETAINER WAS NOT PROPERLY INSTALLED, CAUSING BRAKES TO FAIL, RESULTING IN AN ACCIDENT AFTER RECALL REPAIRS (94V-129). *AK | EVOQ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | V | NaN | NaN | NaN |
1 | 2 | 958146 | General Motors LLC | GMC | SONOMA | 1995 | NaN | 19941215 | N | 0 | 0 | SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS | MOBILE | AL | 1GTCS19W3S8 | 19950103 | 19950103 | NaN | NaN | VEHICLE STALLS AT HIGH SPEED, RESULTING IN LOSS OF STEERING AND BRAKING ABILITY. TT | EVOQ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | V | NaN | NaN | NaN |
2 | 3 | 958127 | Ford Motor Company | FORD | RANGER | 1994 | NaN | NaN | N | 0 | 0 | ENGINE AND ENGINE COOLING:EXHAUST SYSTEM | N. LAUDERDAL | FL | NaN | 19950103 | 19950103 | NaN | NaN | EXHAUST SYSTEM FAILS; PLEASE DESCRIBE DETAILS. TT | EVOQ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | V | NaN | NaN | NaN |
3 | 4 | 958170 | Ford Motor Company | MERCURY | COUGAR | 1995 | NaN | 19950101 | N | 0 | 0 | SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS | CORRAL SPRIN | FL | 1MELM62W5SH | 19950103 | 19950103 | NaN | 1 | BRAKING SYSTEM FAILURE WITHOUT ABS BRAKES. TT | EVOQ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | V | NaN | NaN | NaN |
4 | 5 | 958149 | Nissan North America, Inc. | NISSAN | MAXIMA | 1987 | NaN | 19941223 | N | 0 | 0 | VISIBILITY:SUN ROOF ASSEMBLY | COLUMBUS | OH | JN1HU11P3HX | 19950103 | 19950103 | NaN | 1 | VEHICLES SUN ROOF GLASS FLEW OFF WHILE DRIVING. TT | EVOQ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | V | NaN | NaN | NaN |
We aren’t really concerned with the data itself just now, as we’re only going to take a sample. Again, most will be about airbags, some will not. We’re only going to save the complaint description because that’s all we really care about!
# Take a random sample of rows
sampled = df.sample(50)
# Take a sample involving airbags
airbag_df = df[df.CDESCR.str.contains("AIR ?BAG", na=False)]
airbag_df = airbag_df.sample(300)
# Combine them so we have all sorts for the machine to learn about
sampled = pd.concat([sampled, airbag_df])
# We don't know whether they're suspicious or not
sampled['is_suspicious'] = ''
# We only want a few columns
sampled = sampled[['is_suspicious', 'CDESCR']]
# Save them
sampled.to_csv("data/sampled-unlabeled.csv", index=False)
3.3.2 Labeling our sample of complaints
Now comes the fun part: opening up sampled-unlabeled.csv
and filling in labels for each and every one of those hundreds of rows. We’ll use 1
if it’s suspicious, 0
if it isn’t.
For example, this complaint isn’t suspicious because it’s about an air bag not deploying:
DURING AN ACCIDENT AIR BAG'S DID NOT DEPLOY. DEALER HAS BEEN CONTACTED. *AK
This next one isn’t suspicious either, because it isn’t even about airbags!
DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
But if something involving explosions or shrapnel happens, it’s probably worth marking as suspicious:
I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE. *TR
We’ll just go down the left-hand column in Excel, filling them in as 0
or 1
. You’ll probably want to turn on Wrap Text in Excel so you don’t have to scroll left and right so much.