2.7 Alternative techniques

We have a problem, though: “for every 25 percentage point increase in minorities, it’s additional 1 day of wait time” just isn’t very understandable. It doesn’t roll off the tongue, it doesn’t make sense very easily, and it’s going to be lost on a lot of your readers.

Even though linear regression is a nice advanced-lash method, it doesn’t mean it’s always the right one. Let’s try something easier:

df['majority_white'] = (df.pct_minority < 50).astype(int)
df.groupby('majority_white').wait_days.median()
## majority_white
## 0    4.208333
## 1    2.750000
## Name: wait_days, dtype: float64

2.7.1 Binning

While it’s easy to understand majority white vs. majority minority, we could even break it down into a few more categories. While it isn’t as easy as splitting into two groups, it’s a little more nuanced while still being understandable. This is called binning.

In the example below, we’ll cut them into brackets of 20 percentage points:

  • 0-20% minority
  • 20-40% minority
  • 40-60% minority
  • 60-80% minority
  • and 80-100% minority
bins = range(0, 101, 20)
df['bin'] = pd.cut(df.pct_minority, bins)
df.head()
##                address         GEOID      Geo_FIPS  pct_white  pct_minority  \
## 0       3839 N 10TH ST  5.507900e+10  5.507900e+10   2.405063     97.594937   
## 1    4900 W MELVINA ST  5.507900e+10  5.507900e+10   8.824796     91.175204   
## 2  2400 W WISCONSIN AV  5.507901e+10  5.507901e+10  40.313725     59.686275   
## 3    1800 W HAMPTON AV  5.507900e+10  5.507900e+10   4.389407     95.610593   
## 4       4718 N 19TH ST  5.507900e+10  5.507900e+10   4.389407     95.610593   
## 
##    wait_days  majority_white        bin  
## 0   1.250000               0  (80, 100]  
## 1   8.833333               0  (80, 100]  
## 2   9.750000               0   (40, 60]  
## 3   2.416667               0  (80, 100]  
## 4  17.416667               0  (80, 100]

It seems like we’d use range(0, 100, 20), but nope! Always add one more to make sure your range includes the final number. Now w e can group by the bin and see how a slow increase in demographics affects the wait days.

df.groupby('bin').wait_days.median()
## bin
## (0, 20]      2.208333
## (20, 40]     2.916667
## (40, 60]     3.270833
## (60, 80]     4.291667
## (80, 100]    4.250000
## Name: wait_days, dtype: float64

Way more interesting, right? And much easier to communicate to your readers, to boot.