Visualizing data with streamgraphs in Python#
Streamgraphs are a visualization technique isn't used too often, but people get really excited when they see it. It's a stacked area graph that's centered vertically, and comes with a handful of pros and cons. Let's reproduce the streamgraphs from this Bloomberg piece.
Building the streamgraph#
The real fun part of the piece from Bloomberg is definitely the visualizations. Streamgraphs are when you take boring old stacked area charts and center them vertically!
Here's our target:
Let's do a little importing before we get too far in.
import pandas as pd
import matplotlib.pyplot as plt
# Make our graphics a little prettier
plt.style.use('ggplot')
Our data#
We're going to start with our labeled tweet dataset from the last section.
df = pd.read_csv("data/tweets-categorized.csv")
df.head(5)
Converting strings to dates#
We're going to be plotting these based on the week (or 8 days or 2 days or whatever), so we'll need to convert our date column to an actual date. Right now our "date" columns is just a string.
df.dtypes
When your date is an object
you usually have to wrangle it around with pd.to_datetime
to make things work. Magically enough you usually get to just say "hey, convert this to a datetime" and it works automatically.
# Convert the date to a datetime, then pull out the week
df['date'] = pd.to_datetime(df.date)
df.head(2)
Did it really work?
df.dtypes
We now see that it's a datetime64[ns, UTC]
, which means we can do things like pull out the day of the year or the month or the week or all sorts of magic! Instead of that, though, we're going to group our data by each week.
Grouping by dates with resampling#
Grouping by time is calling resampling, and it's remarkably easy! We're going to pull out Kamala Harris's tweets, and then tell it to resample by 8-day chunks using the date
column.
Why 10? We need to be able to divide easily later, sorry.
# Resample and make it sum every 7 days
harris = df[df.username == 'KamalaHarris'].resample('8D', on='date').sum()
harris.head()
We can plot a normal stacked area chart with that...
ax = harris.plot(kind='area', stacked=True)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
Streamgraphs with resampled data#
But we came here for streamgraphs, right? Those magic centered ones? Pandas can't do those by itself, so we'll have to peer into the abyss of matplotlib directly.
fig, ax = plt.subplots(figsize=(10,5))
# Plot a stackplot - https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/stackplot_demo.html
ax.stackplot(harris.index, harris.T, baseline='wiggle', labels=harris.columns)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
You want to smooth out those tragically sharp points? Fine, but it involves inventing fake data.
Interpolating to smooth our streamgraph#
We can't just say, "draw smooth lines!" Matplotlib needs actual data. It's very sharp because right now the data is only every eight days, and each one can be a very sharp jump to the next.
harris.head()
What we're going to do is pretend we have data every two days. But instead of pretend, we're going to tell pandas to create this fake data.
First we'll make a list of all of the days we want to exist.
# Make a list of dates between the first and last
first = harris.index.min()
last = harris.index.max()
# Go between the first and the last in 2-day chunks
frequency = pd.date_range(start=first, end=last, freq='2D')
frequency[:10]
Now we're going to add them to our dataframe. The data will be missing when we add the new rows, because pandas sure isn't going to guess what should go there.
# Reindex our dataframe, adding a bunch of new days, but missing data!
smooth = harris.reindex(frequency)
smooth.head(6)
Let's fill that data in through interpolation. We'll tell it to use quadratic interpolation to make it nice and smooth.
I'm putting this all in one cell so you can cut and paste more easily
# Plan out 2-day chunks between the first and last days
first = harris.index.min()
last = harris.index.max()
frequency = pd.date_range(start=first, end=last, freq='2D')
# Inject the new (empty) rows, then interpolate new data
smoothed = harris.reindex(frequency).interpolate(method='quadratic')
smoothed.head()
And now graph it!
fig, ax = plt.subplots(figsize=(10,5))
# Plot a stackplot - https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/stackplot_demo.html
ax.stackplot(smoothed.index, smoothed.T,
baseline='wiggle', labels=smoothed.columns)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
# Set the title
ax.set_title("Kamala Harris twitter topics")
There we go, nice and smooth.
Review#
In this section we learned how to visualize categorized data over time using a streamgraph. We also learned how to interpolate
Discussion topics#
When were sample, we're talking about data points we don't have. Is that lying?
How would a bar graph be different than an area graph? If you'd like to see it yourself, try using stepwise interpolation up above.