Counting museums in China#
In this section we'll reproduce the graphics from this piece here, where Caixin does a per-capita analysis of the museums in China.
Setup#
We'll import pandas as usual, but we'll also need to do some special matplotlib setup. This allows us to have graphs with Chinese characters - if we don't, every time we graph we'll get a lot of errors and the text won't look right.
import pandas as pd
import matplotlib.pyplot as plt
# So Chinese characters can appear correctly
plt.rcParams['font.sans-serif'] = ['SimHei', 'SimSun', 'Microsoft YaHei New', 'Microsoft YaHei', 'Arial Unicode MS']
%matplotlib inline
Importing our data#
We'll be using the data we previously cleaned.
df = pd.read_csv("data/museums-cleaned.csv")
df.head()
How many museums do we have?
df.shape
Great. 4469 rows, 6 columns. 4471 museums, 6 pieces of data about each.
Counting values#
How many museums are in each province?
df.region.value_counts()
Honestly we use value_counts()
to count everything in almost every column.
df.博物馆性质.value_counts()
df.质量等级.value_counts()
df.是否免费开放.value_counts()
Crosstab for combinations#
What if we want to see how many of a combination? pd.crosstab
to the rescue!
pd.crosstab(df.博物馆性质, df.是否免费开放)
pd.crosstab(df.博物馆性质, df.是否免费开放)
pd.crosstab(df.region, df.是否免费开放)
Percentage crosstabs#
Instead of pure counts, sometimes you want crosstab to return a percentage. In this case, we'll just pass normalize='index'
to have each column be a percentage of the row.
pd.crosstab(df.region, df.博物馆性质, normalize='index')
We can even graph it...
pd.crosstab(df.region, df.博物馆性质, normalize='index').plot(kind='barh', figsize=(5,10))
...but it looks much better stacked! We can do this because each row adds up to 100%.
pd.crosstab(df.region, df.博物馆性质, normalize='index').plot(kind='barh', figsize=(5,10), stacked=True)
Per capita adjustments#
To judge the number of museums per person (or the number of people per museum), we'll need to combine the province counts with population counts.
regions = df.region.value_counts().to_frame('museums').reset_index()
regions.head()
This dataset has been cleaned a little bit to make sure the columns match.
pop = pd.read_csv("data/population-cleaned.csv")
pop.head()
We'll merge on the Chinese name for each region.
merged = regions.merge(pop, right_on='Chinese', left_on='index')
merged
And then perform some small calculations to build two new columns relating the number of people to the number of museums.
merged['people_per_museum'] = merged.Population / merged.museums
merged['museums_per_1m'] = merged.museums / merged.Population * 1000000
merged.head()
Viewing our results#
We only looked at the first five above because we're probably interested in the sorted version.
merged.sort_values(by='museums_per_1m', ascending=False)
merged.sort_values(by='museums_per_1m').plot(x='Chinese', y='museums_per_1m', kind='barh', figsize=(5,10))
The actual Caixin piece graphic gives you the people per museum, and draws a line at 300,000 people.
We can easily reproduce that one with matplotlib. If your characters are showing up weird, make sure you ran the font-setting code up at the top!
merged.sort_values(by='people_per_museum', ascending=False).plot(x='Chinese', y='people_per_museum', kind='bar', figsize=(10, 5), color='#8b70b1')
plt.axhline(300000, color='black')
Discussion topics#
Why do we look at per capita museums in each province instead of the raw numbers?
When you talk about a "bigger" province, you could talk about either popular or how physically large the area is. Why does per capita make more sense here?
宁夏回族自治区 and 青海省 have a large number of museums, per-capita, but not very museums overall (around 40, compared to 100-250 in the other high per-capita museums). Does it seem reasonable that they're listed between places like 陕西省 and 陕西省 which both have over 200 museums each?
We calculated two numbers for this data - people per museum, and museums per person. What are the different feelings associated with each angle? How would the chart look different if it were presented as museums per person instead of people per museum?