7.2 Sparse vs dense data
Why the big difference? Some of it has to do with the difference between sparse vs. dense data.
LinearSVC works really well with something called “sparse data,” which is a dataset that has a lot of zeros in it. Let’s start by explaining the opposite of it, “dense data” - most of the time when we’re dealing with data, every single column and every single row has something in it.
For example, if we’re doing something about cars, every car might have a manufacturer, a model, a year it was made, a weight and an estimate of miles per gallon.
manufacturer | model | year | weight | miles |
---|---|---|---|---|
Ford | F150 | 1980 | 2500 | 36 |
Ford | F150 | 1980 | 2500 | 36 |
Ford | F150 | 1980 | 2500 | 36 |
Ford | F150 | 1980 | 2500 | 36 |
This dataset is is “dense” because almost everything is filled in, and there isn’t missing data or zeroes. Sparse data, on the other hand, only has some of the columns filled in. For example, if we’re counting the words we find in a sentence, most sentences only have a few words in them.
Example sentence | Suspect | Victim | baseball | bat | shot | car | fled |
---|---|---|---|---|---|---|---|
blah blah | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
blah blah | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
blah blah | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
blah blah | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
We might be looking at a few hundred or thousand different words, but sentences aren’t that long, and most words will be marked as ‘0’ in the row. This is sparse data.
Support vector machines work well with sparse data, while random forests can be hit or miss. The reasons why one or another kind of classifier might work better has to do with the technical details of how the classifier works, which I’m actually not terribly worried about!
While that sounds like a terrible thing to say, it’s honestly more valuable to spend time thinking about what happens when our predictions are right or wrong - what it means if we misclassify, what it means if the LAPD misclassifies. That way instead of just arbitrarily shooting for a certain kind of number or accuracy with our machine learning, we can understand the tradeoffs about the fact that sometimes we will be wrong.
If you do remember the kinds of data different classifiers work better with, great work! You’ll be able to save a little bit of time. If not, no big deal, you’ll do the same thing many data science professionals do - semi-randomly trying things and tweaking options until you get a result you’re satisfied with. No harm in that!