4.2 Classification

The kind of problem we’re dealing with here is called a classification problem. That’s because we have two different classes of complaints:

  • Complaints that are suspicious
  • Complaints that are not suspicious

We’re going to take the complaints that we didn’t label, and hold it up to the computer - the machine’s job is to classify new complaints in one of those two categories. Before we put it on the job, though, we need to teach it what each category of complaint looks like.

4.2.1 Training data

Teaching a machine learning algorithm about a dataset is called training.

We train our classifier the same way we trained ourselves - by making it read all a bunch of comments! Because we marked each one as suspicious or not suspicious, the computer is able to learn from the work we did.

Using code, we’ll present each row to the classifier and say hi, please remember that a comment like this is suspicious (or not suspicious).

labeled_df = pd.read_csv("data/sampled-labeled.csv")

# Some weren't labeled - let's just drop those!
labeled_df = labeled_df.dropna()

labeled_df.head()
is_suspicious CDESCR
0 0 ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS TURNED INTO MY DREAM NIGHTMARE. CADILLAC CTS 3.6 2008 WHEN I GET ON IT A LITTLE BIT ACCELERATION IT MAKES A SOUND THAT SOUNDS LIKE ALL AIR LEAKS INSIDE THE CAR. THE DEALER WAS REPORTED EVER SINCE MY FIRST EARLY VISIT TO SERVICE CENTER BUT IT’S ONLY DURING MY LAST VISIT TO THE DEALER THEY MENTIONED GM HAS SANCTIONED APPROVAL IE. AT THE TIME THE ODOMETER READS 65000KM? STRANGE! THE DOOR LOCKS ARE TERRIBLE?DESPITE RECTIFYING; TIME AND AGAIN BY YOUR DEALER THE PROBLEM STILL PERSIST TO DATE. SAFETY HAZARD INDEED. THE COMPUTER HAD ERROR AS FOR TYRE LOW AIR PRESSURE VERY BAD ON SAFETY STANDARDS. ON 12TH AUG 2012 WHILE I WAS EN-ROUTE TO THE CADILLAC SERVICE CENTER THE VEHICLE HAD A BREAK DOWN. ON THE DISPLAY SCREEN NEAR ODOMETER IT DISPLAYED ENGINE TEMPERATURE TOO HIGH ?. DUE TO THIS I HAD TO PULL THE CAR TOWARDS SAFETY AND GOT STRANDED IN THE MIDDLE OF THE ROAD UNDER THE HOT SUN. HOW CAN YOU JUSTIFY A CAR WHICH IS REGULARLY MAINTAINED BY YOUR AUTHORIZED AGENCY AT REGULAR PERIODIC INTERVALS TO HAVE SUCH A FATE? I FELT IT WAS GOOD TO HAVE IT SERVICED DURING MY ABSENCE IN TOWN BEFORE THE REGULAR KM INTERVAL; BUT, ONLY TO GET STRANDED ON THE ROAD IN THE HOT SUN. IT IS INDEED, TOO MUCH TO SUFFER AFTER BUYING A CAR OF GM FLAGSHIP BRAND < CADILLAC > AND SUFFER AGONY ON THE ROAD SIDE. NOT TO MENTION THE DEALERSHIPS WARRANTY YOU PROVIDE WITH THE PURCHASE OF THE NEW CAR FROM YOUR AUTHORIZED AGENTS, PROBABLY DON’T FIX ANYTHING WITH THE WARRANTY SERVICE PROGRAM. IF THE JOB WAS RIGHT THEN THE ENGINE SHOULD NOT HAVE OVER HEATED ESPECIALLY WHEN THE CAR IS JUST RUN 71000 KMS. APPROX… *TR
1 0 CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT INSIDE THE VEHICLE. VEHICLE WAS RUNNING AT THE TIME. *AK
2 0 DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
3 0 TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNITION WOULD NOT START THE VEHICLE. THE STEERING LOCK LIGHT ILLUMINATED ON THE INSTRUMENT PANEL WHEN THE FAILURE OCCURRED. THE VEHICLE WAS TOWED TO THE DEALER WHO STATED THE STEERING LOCK NEEDED TO BE REPLACED. THE DEALER RESPONDED AS IF THIS WAS AN ISOLATED ISSUE. THE VEHICLE WAS REPAIRED. THE FAILURE MILEAGE AND CURRENT MILEAGE WAS 57,915. UPDATED 3/28/13 CN UPDATED 05/10/2013 JS
4 0 THE FRONT MIDDLE SEAT DOESN’T LOCK IN PLACE. *AK

Remember how we picked a list of features, or words for our algorithm to pay attention to? Let’s now make a dataframe to see which rows have which words.

To use the list of words, we’re just going to make a new dataframe where there’s a 1 if the word is in the description and a 0 if it isn’t.

.str.contains gives us False or True, and making it an integer with .astype(int) will turn it into 0 or 1 (machine learning gets grumpy about anything that isn’t numbers). Along with the words, we’ll also save the is_suspicious label to keep everything in the same place.

training_features = pd.DataFrame({
    'is_suspicious': labeled_df.is_suspicious,
    'airbag': labeled_df.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': labeled_df.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': labeled_df.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': labeled_df.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': labeled_df.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': labeled_df.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': labeled_df.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
training_features.head()
is_suspicious airbag air bag failed did not deploy violent explode shrapnel
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0

Not all of the comments we have are suspicious (which is correct, since we need examples of both!). How many suspicious airbag events do we have in our dataset?

training_features.is_suspicious.value_counts()
## 0.0    150
## 1.0     15
## Name: is_suspicious, dtype: int64

Okay, maybe we need to do a better job getting more suspicious ones in there. We’ll try to cope with it for now, though.