Read in our data#
Let's use pandas to read in our data file.
import pandas as pd
df = pd.read_csv("data/museums.csv")
df.head()
It already doesn't look that great! Let's spend some time cleaning this up.
Cleaning our columns#
Some column names have spaces, returns, or other 'weird' non-typable characters. For example, we might see 质量等 级
instead of 质量等级
(how it ends up displaying actually depends on our computer!).
We don't like that, so let's remove them.
df.columns = df.columns.str.replace('\r','')
df.head()
Copying headers down#
The first row has NaN
in it, which means missing data. NaN
is also called "missing", or "N/A" or "null".
If we want to see every row with missing data, we can run this code.
df[df.isnull().any(axis=1)]
It would be nice if those values were actually moved into another column. Let's look at them again, only looking at the values themselves.
df[df.isnull().any(axis=1)].博物馆名称
Let's copy that into a new column.
df['region'] = df[df.isnull().any(axis=1)].博物馆名称
df.head()
Now let's fill in all of those empty values.
df.region = df.region.fillna(method='ffill')
df.head()
Now let's drop everything with missing data.
df.dropna(inplace=True)
df.head()
Fixing up the text#
Maybe we can clean up the 'region' column, too, and remove the parentheses part.
df.region = df.region.str.replace('\(.*\)', '')
df.head()
Great!
Saving our data#
Let's save this into a new file. We need to use some specific quoting rules, or else it won't work! I'm not sure if this is due to it because Chinese or because some of these rows have multiple lines of data in them.
import csv
df.to_csv("data/museums-cleaned.csv", index=False, quoting=csv.QUOTE_ALL)