Cleaning up our Chinese museum dataset#

After exporting our Chinese museum dataset from Tabula, it still isn't perfect. Let's fix up a few issues with the column headers and add a new column.

Read online Download notebook Interactive version

Read in our data#

Let's use pandas to read in our data file.

import pandas as pd

df = pd.read_csv("data/museums.csv")
df.head()

	博物馆名称	博物馆性\r质	质量等\r级	是否免费\r开放	地址
0	北京市(151家)	NaN	NaN	NaN	NaN
1	故宫博物院	文物	一级	否	东城区景山前街4号
2	中国国家博物馆	文物	一级	是	北京市东城区东长安街16号
3	中国人民革命军事博物馆	行业	一级	是	海淀区复兴路9号
4	北京鲁迅博物馆(北京新文化运动\r纪念馆)	文物	一级	是	阜成门内宫门口二条19号\r东城区五四大街29号

It already doesn't look that great! Let's spend some time cleaning this up.

Cleaning our columns#

Some column names have spaces, returns, or other 'weird' non-typable characters. For example, we might see 质量等级 instead of 质量等级 (how it ends up displaying actually depends on our computer!).

We don't like that, so let's remove them.

df.columns = df.columns.str.replace('\r','')
df.head()

	博物馆名称	博物馆性质	质量等级	是否免费开放	地址
0	北京市(151家)	NaN	NaN	NaN	NaN
1	故宫博物院	文物	一级	否	东城区景山前街4号
2	中国国家博物馆	文物	一级	是	北京市东城区东长安街16号
3	中国人民革命军事博物馆	行业	一级	是	海淀区复兴路9号
4	北京鲁迅博物馆(北京新文化运动\r纪念馆)	文物	一级	是	阜成门内宫门口二条19号\r东城区五四大街29号

Copying headers down#

The first row has NaN in it, which means missing data. NaN is also called "missing", or "N/A" or "null".

If we want to see every row with missing data, we can run this code.

df[df.isnull().any(axis=1)]

	博物馆名称	博物馆性质	质量等级	是否免费开放	地址
0	北京市(151家)	NaN	NaN	NaN	NaN
147	天津市(58家)	NaN	NaN	NaN	NaN
204	河北省(100家)	NaN	NaN	NaN	NaN
302	山西省(126家)	NaN	NaN	NaN	NaN
424	内蒙古自治区(198家)	NaN	NaN	NaN	NaN
617	辽宁省(97家)	NaN	NaN	NaN	NaN
711	吉林(107家)	NaN	NaN	NaN	NaN
816	黑龙江省(200家)	NaN	NaN	NaN	NaN
1010	上海市(119家)	NaN	NaN	NaN	NaN
1127	江苏省(284家)	NaN	NaN	NaN	NaN
1402	浙江省(286家)	NaN	NaN	NaN	NaN
1680	安徽省(189家)	NaN	NaN	NaN	NaN
1863	福建省(115家)	NaN	NaN	NaN	NaN
1975	江西省(141家)	NaN	NaN	NaN	NaN
2113	山东省(351家)	NaN	NaN	NaN	NaN
2453	河南省(274家)	NaN	NaN	NaN	NaN
2719	湖北省(204家)	NaN	NaN	NaN	NaN
2917	湖南省(134家)	NaN	NaN	NaN	NaN
3047	广东省(261家)	NaN	NaN	NaN	NaN
3300	广西壮族自治区(102家)	NaN	NaN	NaN	NaN
3400	海南省(25家)	NaN	NaN	NaN	NaN
3425	重庆市(72家)	NaN	NaN	NaN	NaN
3495	四川省(223家)	NaN	NaN	NaN	NaN
3712	贵州省(84家)	NaN	NaN	NaN	NaN
3794	云南省(105家)	NaN	NaN	NaN	NaN
3896	西藏自治区(8家)	NaN	NaN	NaN	NaN
3905	陕西省(244家)	NaN	NaN	NaN	NaN
4142	甘肃省(190家)	NaN	NaN	NaN	NaN
4326	青海省(33家)	NaN	NaN	NaN	NaN
4359	宁夏回族自治区(40家)	NaN	NaN	NaN	NaN
4399	新疆维吾尔自治区(105家)	NaN	NaN	NaN	NaN

It would be nice if those values were actually moved into another column. Let's look at them again, only looking at the values themselves.

df[df.isnull().any(axis=1)].博物馆名称

0            北京市(151家)
147           天津市(58家)
204          河北省(100家)
302          山西省(126家)
424       内蒙古自治区(198家)
617           辽宁省(97家)
711           吉林(107家)
816         黑龙江省(200家)
1010         上海市(119家)
1127         江苏省(284家)
1402         浙江省(286家)
1680         安徽省(189家)
1863         福建省(115家)
1975         江西省(141家)
2113         山东省(351家)
2453         河南省(274家)
2719         湖北省(204家)
2917         湖南省(134家)
3047         广东省(261家)
3300     广西壮族自治区(102家)
3400          海南省(25家)
3425          重庆市(72家)
3495         四川省(223家)
3712          贵州省(84家)
3794         云南省(105家)
3896         西藏自治区(8家)
3905         陕西省(244家)
4142         甘肃省(190家)
4326          青海省(33家)
4359      宁夏回族自治区(40家)
4399    新疆维吾尔自治区(105家)
Name: 博物馆名称, dtype: object

Let's copy that into a new column.

df['region'] = df[df.isnull().any(axis=1)].博物馆名称
df.head()

	博物馆名称	博物馆性质	质量等级	是否免费开放	地址	region
0	北京市(151家)	NaN	NaN	NaN	NaN	北京市(151家)
1	故宫博物院	文物	一级	否	东城区景山前街4号	NaN
2	中国国家博物馆	文物	一级	是	北京市东城区东长安街16号	NaN
3	中国人民革命军事博物馆	行业	一级	是	海淀区复兴路9号	NaN
4	北京鲁迅博物馆(北京新文化运动\r纪念馆)	文物	一级	是	阜成门内宫门口二条19号\r东城区五四大街29号	NaN

Now let's fill in all of those empty values.

df.region = df.region.fillna(method='ffill')
df.head()

	博物馆名称	博物馆性质	质量等级	是否免费开放	地址	region
0	北京市(151家)	NaN	NaN	NaN	NaN	北京市(151家)
1	故宫博物院	文物	一级	否	东城区景山前街4号	北京市(151家)
2	中国国家博物馆	文物	一级	是	北京市东城区东长安街16号	北京市(151家)
3	中国人民革命军事博物馆	行业	一级	是	海淀区复兴路9号	北京市(151家)
4	北京鲁迅博物馆(北京新文化运动\r纪念馆)	文物	一级	是	阜成门内宫门口二条19号\r东城区五四大街29号	北京市(151家)

Now let's drop everything with missing data.

df.dropna(inplace=True)
df.head()

	博物馆名称	博物馆性质	质量等级	是否免费开放	地址	region
1	故宫博物院	文物	一级	否	东城区景山前街4号	北京市(151家)
2	中国国家博物馆	文物	一级	是	北京市东城区东长安街16号	北京市(151家)
3	中国人民革命军事博物馆	行业	一级	是	海淀区复兴路9号	北京市(151家)
4	北京鲁迅博物馆(北京新文化运动\r纪念馆)	文物	一级	是	阜成门内宫门口二条19号\r东城区五四大街29号	北京市(151家)
5	中国地质博物馆	行业	一级	否	西城区西四羊肉胡同15号	北京市(151家)

Fixing up the text#

Maybe we can clean up the 'region' column, too, and remove the parentheses part.

df.region = df.region.str.replace('\(.*\)', '')
df.head()

	博物馆名称	博物馆性质	质量等级	是否免费开放	地址	region
1	故宫博物院	文物	一级	否	东城区景山前街4号	北京市
2	中国国家博物馆	文物	一级	是	北京市东城区东长安街16号	北京市
3	中国人民革命军事博物馆	行业	一级	是	海淀区复兴路9号	北京市
4	北京鲁迅博物馆(北京新文化运动\r纪念馆)	文物	一级	是	阜成门内宫门口二条19号\r东城区五四大街29号	北京市
5	中国地质博物馆	行业	一级	否	西城区西四羊肉胡同15号	北京市

Great!

Saving our data#

Let's save this into a new file. We need to use some specific quoting rules, or else it won't work! I'm not sure if this is due to it because Chinese or because some of these rows have multiple lines of data in them.

import csv

df.to_csv("data/museums-cleaned.csv", index=False, quoting=csv.QUOTE_ALL)

Cleaning up our Chinese museum dataset#

Read in our data#

Cleaning our columns#

Copying headers down#

Fixing up the text#

Saving our data#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects