Segmenting and counting words in Chinese museum names#
Let's reproduce part of this piece from Caixin, where the names of Chinese museums are segmented out and the most popular words are counted.
# Use pip to install jieba, a Chinese text-segmentation library
#!pip install jieba
import pandas as pd
import matplotlib.pyplot as plt
import re
# So Chinese characters can appear correctly
plt.rcParams['font.sans-serif'] = ['SimHei', 'SimSun', 'Microsoft YaHei New', 'Microsoft YaHei', 'Arial Unicode MS']
Our target#
We'll be aiming to reproduce this graphic specifically - we won't use a word cloud, but we'll try to get counts for each word used in the museum names.
Our dataset#
We'll start off by reading in our cleaned dataset of museums. Nothing too crazy going on yet!
df = pd.read_csv("data/museums-cleaned.csv")
df.head()
And how many museums are in our dataset?
df.shape
While we have a handful of columns here, we're only interested in 博物馆名称, the museum name.
Cleaning museum names#
If we wanted to analyze each museum separately we could start from here, but we want to talk about all of the words used in all of the museum names. To do that, we'll need all of our names in one string.
Let's combine all of the museum names, putting spaces in between each name.
museum_names = ' '.join(df.博物馆名称)
museum_names[:400]
Notice that there are some characters in there that are not word-related - things like (
and \\r
and )
.
To remove all non-word-related characters, we'll use a regular expression. Our regex will be [^\w]
, which means "match everything that is not a word character."
museum_names = re.sub('[^\w]', ' ', museum_names)
museum_names[:400]
That looks a lot better!
Extracting words with jieba#
Now we'll need to split these museum names into separate words. For example, we'll want to separate 中国国家博物馆
into 中国
, 国家
, and 博物馆
.
Because we can't split on spaces like we could with English, we'll need to use the jieba package to cut the text into separate words.
import jieba
jieba.lcut('中国国家博物馆')
How does that make you feel? Do you want to break it down a little further? We can do that by adding another parameter.
jieba.lcut('中国国家博物馆', cut_all=True)
The second version is normally used for search engines, as it gives more options for matches. I'm going to go with the second one because I think it'll make our results more interesting.
# Split the words
words = jieba.lcut(museum_names, cut_all=True)
# Let's look at the first thirty
print(words[:30])
Those empty spaces - the ''
- aren't words! Let's get rid of them.
# Filter out all of the ''
# We can't use .remove because it only removes one
words = [word for word in words if word != '']
# Check the first thirty again
print(words[:30])
Counting the words#
Counting words is simple - we just use Python's Counter
. It'll count the same as it would anything else.
from collections import Counter
counts = Counter(words)
counts.most_common(20)
# Create the dataframe
counts_df = pd.DataFrame({
'count': counts
})
# Pull a random 10 out
counts_df.sample(10)
We can also toss them into a dataframe for easy graphing.
counts_df.sort_values(by='count', ascending=False).head(10).plot(kind='barh')
Cleaning up the output#
I don't like the fact that 博物
and 博物馆
are both in there, and the same thing with 纪念
and 纪念馆
. This is because we used cut_all=True
, so we could make it stop by just... not doing that.
Instead, let's make things a little more complicated: let's create a list of words we don't like and make them all excluded.
# This includes empty string and a space (third line)
stopwords = """
纪念
博物
""".split("\n")
print(stopwords)
words = jieba.lcut(museum_names, cut_all=True)
words = [word for word in words if word not in stopwords]
counts = Counter(words)
counts.most_common(20)
If we didn't want to do that, we could also get rid of the cut_all=True
to have bigger segments with no repetition.
words = jieba.lcut(museum_names)
words = [word for word in words if word not in stopwords]
counts = Counter(words)
counts.most_common(20)
Review#
In this section we looked at a real-life example of using word segmentation and counting with jieba, reproducing a piece from Caixin. We played around with two different ways of segmenting text using jieba - cut_all
being True
or not - and saw that while we get a lot more results with cut_all
turned on, it might not be appropriate for all situations.
Discussion topics#
We relied on jieba for word segmentation, but it didn't necessarily give the expected results. For example, 中国国家博物馆
probably makes the most sense as 中国
/国家
/博物馆
, but the only options with jieba were 中国
/国家博物馆
or 中国
/国家
/博物
/博物馆
. Do we feel comfortable with output that isn't necessarily what a human being would pick?
We used a list of stopwords at the end to prevent words we "didn't like" from showing up in the results. Is this unfair to the computer's work, or biasing our results?
One way to do this without jieba would be to print out the list of museum names and manually add spaces everywhere that we think makes sense. Then we could use museum_names.split(' ')
to separate them. This approach would probably take us about 1 second each, plus breaks for being tired. With around 4500 museums, we'd finish in under two hours. Would it be worth it for a more accurate word segmentation?