4.3 Training our algorithm

Now that we’ve decided on our algorithm, we can take our classifier and teach it about our documents. This is called training.

There’s one problem, though! Just because we tried to make an algorithm doesn’t mean it’s gonna be any good. And it sure would be rude to put a classifier out into the world that doesn’t quite work, right?

Before we start using our classifier in the world, we need some way of judging its performance. This is called testing.

Testing in machine learning works a lot like testing in a normal classroom.

  1. You have a big test coming up.
  2. Your teacher gives you some sample problems or maybe previous tests to study, all of which have the correct answers included.
  3. On test day, she gives you new problems, ones you’ve never seen before. They’re similar to the sample problems you studied, though.
  4. Unlike you, teacher knows the answers, so when you hand in your test she can judge how well you performed.

Note that the study problems and the test problems aren’t the same problems. If they were, you could just memorize the answers! Instead, your instructor gave you enough sample problems to study that you have a pretty good idea of what the test problems will be like.

If we want to run our machine learning world like a class, the first thing we need to do is split up our documents into ones we’ll use for training and ones we’ll use for testing.

from sklearn.model_selection import train_test_split

X = features.drop('successful', axis=1)
y = features.successful

X_train, X_test, y_train, y_test = train_test_split(X, y)

After we split them into these two groups - training and testing - we’ll give the training dataset to the algorithm for studying. This training is when it (hopefully) learns the difference between a successful and unsuccessful FOIA request.

knn.fit(X_train, y_train)
## KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
##                      metric_params=None, n_jobs=None, n_neighbors=20, p=2,
##                      weights='uniform')

Once the classifier is done training, we can test it! We’ll feed it the test data, and see if it thinks each FOIA request succeeded or failed.

y_pred = knn.predict(X_test)

In an ideal world these predictions will match whether the requests were actually successful or not. Unlike a classroom test, though, you’ll rarely see a machine learning algorithm do a perfect job.