7 Explaining predictions

It’s good to know how an algorithm works, but it’s also useful to know how we arrived at a certain outcome. For example, why was this one granted and this one not granted? What can I do to improve my failing FOIA request?

Unfortunately this is one of the most difficult parts of machine learning, especially depending on which algorithm you’re using.

A logistic regression is by far the easiest, as you can read its output as “for each X increase in your average sentence length, your acceptance rate will go up by X.” If it works for your situation, go for it! But this simplistic approach is actually the same reason it did so poorly once we removed the high_success_rate_agency column, so that explainability might not be worth massive drop in quality that logistic regression brings to this problem.

Our random forest performed very well, and can attempt to explain its predictions, but there’s a little gotcha hiding inside.

eli5.explain_prediction_df(forest, X.iloc[0])

feature	weight	value
<BIAS>	0.6958651	1.00000
avg_sen_len	0.0801930	21.55556
word_count	0.0742536	194.00000
ref_fees	0.0478529	1.00000
specificity	0.0355610	8.00000
ref_foia	0.0233319	0.00000
email_address	-0.0031451	0.00000
hyperlink	-0.0133064	0.00000

While it can tell you the features that were important to the decision, it isn’t as easy as “decrease your average sentence length,” as the forest is all of the interactions between them. For example, maybe using more words to get across your point could be helpful, but if you add a hyperlink it’s better to be brief and let the URL do the explaining.

Let’s look at a few more answers.

eli5.explain_prediction_df(forest, X.iloc[1])

target	feature	weight	value
1	avg_sen_len	0.3878024	9.5
1	<BIAS>	0.3041349	1.0
1	word_count	0.0876013	114.0
1	ref_fees	0.0084068	0.0
1	hyperlink	0.0080675	0.0
1	email_address	-0.0019088	0.0
1	ref_foia	-0.0057767	0.0
1	specificity	-0.0283274	32.0

eli5.explain_prediction_df(forest, X.iloc[2])

target	feature	weight	value
1	<BIAS>	0.3041349	1.0
1	word_count	0.2173038	79.0
1	avg_sen_len	0.1467005	39.5
1	specificity	0.1088395	9.0
1	ref_fees	0.0026624	0.0
1	hyperlink	-0.0021454	1.0
1	email_address	-0.0028800	0.0
1	ref_foia	-0.0146157	0.0

It’s tough to explain that in a simple way! You might be able to guess and experiment based on what the important features are, but there’s always a chance the interactions are something you aren’t thinking of.