How to use NLP with scikit-learn vectorizers in Japanese, Chinese (and other East Asian languages) by using a custom tokenizer#

While it's easy to get scikit-learn to play nicely with Japanese, Chinese, and other East Asian languages, most documentation is based around processing English. In this section we'll use a few tricks to override sklearn's English-language focus.

The problem#

Working in English#

When you use scikit-learn to do text analysis, the very first step is usually splitting and counting words. Let's take a simple example of a few English sentences.

texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "Penny is a fish"
]

Now we'll use scikit-learn's CountVectorizer to count the words in each sentence.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

pd.set_option("display.max_columns", 30)
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(texts)

words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df
and at ate blue bought bright bug cat fish fishes is orange penny saw store the to went
0 0 0 0 1 1 1 0 0 0 1 0 0 1 0 0 0 0 0
1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0
2 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 2 0 0
3 0 0 1 0 0 0 1 0 1 0 0 0 3 1 1 1 1 1
4 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0

Nice and easy, right? Scikit-learn's CountVectorizer does a few steps:

  1. Separates the words
  2. Makes them all lowercase
  3. Finds all the unique words
  4. Counts the unique words
  5. Throws us a little party and makes us very happy

If you need review of how all that works, I recommend you check out the advanced word counting and TF-IDF explanations.

The problem shows up, though, when we try to use Japanese.

Working in Japanese#

Let's try the same thing we did above, but using Japanese.

texts_jp = [
    "ペニーは鮮やかな青い魚を買った。",
    "ペニーは明るい青とオレンジの魚を買った。",
    "猫は店で魚を食べました。",
    "ペニーは店に行きました。ペニーは虫を食べました。ペニーは魚を見ました。",
    "ペニーは魚です"
]
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(texts_jp)

words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df
ペニーは店に行きました ペニーは明るい青とオレンジの魚を買った ペニーは虫を食べました ペニーは魚です ペニーは魚を見ました ペニーは鮮やかな青い魚を買った 猫は店で魚を食べました
0 0 0 0 0 0 1 0
1 0 1 0 0 0 0 0
2 0 0 0 0 0 0 1
3 1 0 1 0 1 0 0
4 0 0 0 1 0 0 0

Oof, ouch, wow! That's terrible!

Because scikit-learn's vectorizer doesn't know how to split the Japanese sentences apart (also known as segmentation), it just tries to separate them based on spaces. Since Japanese doesn't use spaces, we end up with each sentence being considered a single word!

Segmenting in non-English languages#

There's another page where we learned to split words in East Asian languages, and it wasn't bad at all. Let's see how it works for an example in Japanese, using the nagisa library.

If you're interested in another language, keep reading! The same concepts apply to Chinese, Vietnamese, etc

import nagisa

text = 'ペニーは鮮やかな青い魚を買った。'
doc = nagisa.tagging(text)

doc.words
['ペニー', 'は', '鮮やか', 'な', '青い', '魚', 'を', '買っ', 'た', '。']

While that's nice and fun and cool and wonderful, it doesn't actually help us with our machine learning. All of the machine learning on this site is based on scikit-learn, where the CountVectorizer or TfidfVectorizer splits the text for us, not some extra library.

So how do we teach scikit-learn to use nagisa?

Using custom text segmentation in scikit-learn#

We have a few options when teaching scikit-learn's vectorizers segment Japanese, Chinese, or other East Asian languages. The easiest technique is to give it a custom tokenizer.

Tokenization is the process of splitting words apart. If we can replace the vectorizer's default English-language tokenizer with the nagisa tokenizer, we'll be all set!

The first thing we need to do is write a function that will tokenize a sentence. Since we'll be tokenizing Japanese, we'll call it tokenize_jp.

# Takes in a document, returns the list of words
def tokenize_jp(doc):
    doc = nagisa.tagging(doc)
    return doc.words
# Test it out

print(tokenize_jp("ペニーは鮮やかな青い魚を買った。"))
print(tokenize_jp("猫は店で魚を食べました。"))
['ペニー', 'は', '鮮やか', 'な', '青い', '魚', 'を', '買っ', 'た', '。']
['猫', 'は', '店', 'で', '魚', 'を', '食べ', 'まし', 'た', '。']

Now all we need to do is tell our vectorizer to use our custom tokenizer.

vectorizer = CountVectorizer(tokenizer=tokenize_jp)
matrix = vectorizer.fit_transform(texts_jp)

words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df
です まし ... 行き 買っ 青い 食べ 鮮やか
0 1 1 0 0 0 1 0 0 1 0 ... 0 0 0 0 1 0 1 0 1 1
1 1 1 0 0 1 0 0 1 1 0 ... 0 0 0 0 1 1 0 0 1 0
2 1 1 1 0 0 0 0 0 1 1 ... 1 0 0 0 0 0 0 1 1 0
3 3 3 0 0 0 0 1 0 3 3 ... 0 1 1 1 0 0 0 1 1 0
4 0 0 0 1 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 25 columns

Data! It's like magic!

Since we're only overriding the tokenizer, we can also do things like use n-grams or custom stopword lists without any trouble.

stop_words = ['。', 'な', 'と', 'た', 'で', 'は']
vectorizer = CountVectorizer(tokenizer=tokenize_jp, ngram_range=(1,2), stop_words=stop_words)
matrix = vectorizer.fit_transform(texts_jp)

words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df
です に 行き の 魚 まし まし ペニー を 見 を 買っ ... 青 オレンジ 青い 青い 魚 食べ 食べ まし 魚 です 魚 を 鮮やか 鮮やか 青い
0 0 0 0 0 0 0 0 1 0 1 ... 0 1 1 0 0 1 0 1 1 1
1 0 0 0 1 1 0 0 1 0 1 ... 1 0 0 0 0 1 0 1 0 0
2 0 0 0 0 0 1 0 1 0 0 ... 0 0 0 1 1 1 0 1 0 0
3 0 1 1 0 0 3 2 2 1 0 ... 0 0 0 1 1 1 0 1 0 0
4 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 1 0 0 0

5 rows × 44 columns

Customizing our Japanese tokenizer further#

Custom stopword lists are nice, but I don't want to type out things like and , I just want to say "please don't include punctuation or particles." It turns out this is possible with nagisa, as in their example:

text = 'Pythonで簡単に使えるツールです'
# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

We can do the same thing by adapting this code to our tokenizer.

# Takes in a document, filtering out particles, punctuation, and verb endings
def tokenize_jp(text):
    doc = nagisa.filter(text, filter_postags=['助詞', '補助記号', '助動詞'])
    return doc.words

vectorizer = CountVectorizer(tokenizer=tokenize_jp)
matrix = vectorizer.fit_transform(texts_jp)

words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df
オレンジ ペニー 明るい 行き 買っ 青い 食べ 鮮やか
0 0 1 0 0 0 0 0 0 1 0 1 0 1 1
1 1 1 0 1 0 0 0 0 1 1 0 0 1 0
2 0 0 1 0 1 0 0 0 0 0 0 1 1 0
3 0 3 1 0 0 1 1 1 0 0 0 1 1 0
4 0 1 0 0 0 0 0 0 0 0 0 0 1 0

Using a TF-IDF vectorizer with Chinese or Japanese#

For most vectorizing, we're going to use a TfidfVectorizer instead of a CountVectorizer. In this example we'll override a TfidfVectorizer's tokenizer in the same way that we did for the CountVectorizer. In this case, though, we'll be telling scikit-learn to use a Chinese tokenizer (jieba, see details here) instead of a Japanese tokenizer.

texts_zh = [
  '翠花买了浅蓝色的鱼',
  '翠花买了浅蓝橙色的鱼',
  '猫在商店吃了一条鱼',
  '翠花去了商店。翠花买了一只虫子。翠花看到一条鱼',
  '翠花是鱼'  
]
# Demo how jieba works
import jieba

jieba.lcut('翠花买了浅蓝色的鱼')
['翠花', '买', '了', '浅蓝色', '的', '鱼']

All we do is write a function that uses jieba as a custom tokenizer, and we're all set!

# Takes in a document, separates the words
def tokenize_zh(text):
    words = jieba.lcut(text)
    return words

# Add a custom list of stopwords for punctuation
stop_words = ['。', ',']

vectorizer = CountVectorizer(tokenizer=tokenize_zh, stop_words=stop_words)
matrix = vectorizer.fit_transform(texts_zh)

words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df
一只 一条 商店 橙色 浅蓝 浅蓝色 看到 翠花 虫子
0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1
1 0 0 1 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1
2 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 0 1
3 1 1 1 2 1 0 1 0 0 0 0 0 0 0 1 3 1 1
4 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1

There we go!

What now?#

You can now do pretty much everything this site uses a vectorizer for, which is most of the natural language processing pieces. Just be careful to ignore anything when we talk about stemming or lemmatization - we have to write very very custom vectorizers to handle English being a horrible language, but you can avoid that and just use your tokenizer=.

For example, in the topic modeling section we build a vectorizer that looks like this:

from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')

analyzer = TfidfVectorizer().build_analyzer()

# Override TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

# Vectorize and count words
vectorizer = StemmedTfidfVectorizer(min_df=50)
matrix = vectorizer.fit_transform(recipes.ingredient_list)

# Get a nice readable dataframe of words
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df.head()

We have to jump through all of those steps because by default, scikit-learn vectorizers don't stem (swim, swims, swimming all turn into swim). As a result we need to do a LOT of overriding.

Languages like Chines or Japanese don't need stemming, though! For example, if you were doing this in Japanese, you could skip all the complex parts and simply stick with a custom tokenizer:

from sklearn.feature_extraction.text import TfidfVectorizer
import nagisa

# Takes in a document, filtering out particles, punctuation, and verb endings
def tokenize_jp(text):
    doc = nagisa.filter(text, filter_postags=['助詞', '補助記号', '助動詞'])
    return doc.words

# Vectorizer and count words (with a custom tokenizer)
vectorizer = TfidfVectorizer(tokenizer=tokenize_jp, min_df=50)
matrix = vectorizer.fit_transform(recipes.ingredient_list)

# Get a nice readable dataframe of words
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df

Of course this won't work on that page because recipes.ingredient_list is in English, but hopefully you get the idea!

Review#

In this section, we learned how to use custom tokenizers to allow scikit-learn to play nicely with languages that don't use spaces to divide words. We specifically focused on building a Japanese vectorizer that used nagisa as well as a Chinese one that used jieba.

For more on specifically Chinese TF-IDF, check this page here. For segmenting words in other languages like Korean, Thai, or Vietnamese, visit our East Asian word splitting page.

Discussion topics#

This is actually not a discussion topic, but a request: if you find or make any NLP-based stories using non-English languages, please send them to me! The only one we have so far is this Caixin reproduction but I'd love to add more.