Word-splitting and text segmentation in East Asian languages#

As different as they are, Chinese, Japanese and Korean are lumped together as CJK languages when discussed from an English-language point of view. One reason they're considered similar is that spacing is not used in the same way as in English. While analyzing English requires splitting sentences based on spaces, one difficulty for this language set (and others) is determining where the breaks between words are. In this section we will review how to use libraries to segment words in these languages as well as Thai and Vietnamese.

Using libraries#

Segmenting is the process of splitting a text into separate words. There isn't always a "right" answer as to what the split should be, so you might have to try a few different libraries before you feel a good fit. The recommendations below aren't necessarily the best Python packages, they're just ones that had a bit of activity, seemingly-decent interfaces/documentation, and no external installs.

Using these libraries with scikit-learn#

After reading this page, you might want to learn how to use these libraries with scikit-learn vectorizers. In that case, check out tutorial on how to make scikit-learn vectorizers work with Japanese, Chinese, and other East Asian languages page after you read this page.

Chinese: jieba#

The Chinese word segmentation library jieba is very popular when analyzing Chinese text.

#!pip install jieba

Jieba has a few different techniques we can use to segment words. We can read the documentation to get into details, but one major question is whether we want the smallest possible divisions.

Using lcut gives us individual words and what might be considered noun phrases.

import jieba

jieba.lcut('我来到北京清华大学')
['我', '来到', '北京', '清华大学']

If we want to divide things up a bit more, we can add cut_all=True.

jieba.lcut('我来到北京清华大学', cut_all=True)
['我', '来到', '北京', '清华', '清华大学', '华大', '大学']

The big difference is cut_all will split something like 清华大学 into both 清华 and 华大.

When might we use one compared to the other?

Japanese: nagisa#

For Japanese, you'll be using the library nagisa.

#!pip install nagisa
import nagisa

text = 'Pythonで簡単に使えるツールです'
doc = nagisa.tagging(text)

doc.words
['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

In addition to simple tokenization, nagisa will also do part-of-speech tagging.

words.postags
['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

This allows you to do things like pluck out all of the nouns

nouns = nagisa.extract(text, extract_postags=['名詞'])
nouns.words
['Python', 'ツール']

Korean: KoNLPy#

Korean does use spaces, but certain characters combine with the "actual" words, so you can't just split on spaces. KoNLPy has several engines that will help you with this. See a comparison chart of the different engines here or find more specific details.

#!pip install konlpy
phrase = "아버지가방에들어가신다"
from konlpy.tag import Hannanum
hannanum = Hannanum()
hannanum.morphs(phrase)
['아버지가방에들어가', '이', '시ㄴ다']
from konlpy.tag import Kkma
kkma = Kkma()
kkma.morphs(phrase)
['아버지', '가방', '에', '들어가', '시', 'ㄴ다']
from konlpy.tag import Komoran
komoran = Komoran()
komoran.morphs(phrase)
['아버지', '가방', '에', '들어가', '시', 'ㄴ다']

Thai: tltk#

You can find many Thai NLP packages here, but we'll focus on tltk. It doesn't have the best documentation and it might not be the most accurate, but it doesn't require us to install anything extra (e.g. TensorFlow) and that's the absolutely only reason why we're using it.

#!pip install tltk
import tltk

phrase = """สำนักงานเขตจตุจักรชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ 
ในเขตอำเภอเมือง จังหวัดอ่างทอง หลังจากนายสุกิจ อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์"""

pieces = tltk.nlp.pos_tag(phrase)
pieces
[[('สำนักงาน', 'NOUN'),
  ('เขต', 'NOUN'),
  ('จตุจักร', 'PROPN'),
  ('ชี้แจง', 'VERB'),
  ('ว่า', 'SCONJ'),
  ('<s/>', 'PUNCT')],
 [('ได้', 'AUX'),
  ('นำ', 'VERB'),
  ('ป้ายประกาศ', 'NOUN'),
  ('เตือน', 'VERB'),
  ('ปลิง', 'NOUN'),
  ('ไป', 'VERB'),
  ('ปัก', 'VERB'),
  ('ตาม', 'ADP'),
  ('แหล่งน้ำ', 'NOUN'),
  (' \n', 'NOUN'),
  ('ใน', 'ADP'),
  ('เขต', 'NOUN'),
  ('อำเภอ', 'NOUN'),
  ('เมือง', 'NOUN'),
  ('<s/>', 'PUNCT')],
 [('จังหวัด', 'NOUN'),
  ('อ่างทอง', 'PROPN'),
  ('<s/>', 'PUNCT'),
  ('หลังจาก', 'SCONJ'),
  ('นาย', 'NOUN'),
  ('สุ', 'PROPN'),
  ('กิจ', 'NOUN'),
  ('<s/>', 'PUNCT')],
 [('อายุ', 'NOUN'), ('<s/>', 'PUNCT')],
 [('65 ', 'NUM'), ('ปี', 'NOUN'), ('<s/>', 'PUNCT')],
 [('ถูก', 'AUX'),
  ('ปลิง', 'VERB'),
  ('กัด', 'VERB'),
  ('แล้ว', 'ADV'),
  ('ไม่ได้', 'AUX'),
  ('ไป', 'VERB'),
  ('พบ', 'VERB'),
  ('แพทย์', 'NOUN'),
  ('<s/>', 'PUNCT')]]

If you just want to split out everything individually, you'll need to jump through a tiny hoop.

words = [word for piece in pieces for word in piece]
print(words)
[('สำนักงาน', 'NOUN'), ('เขต', 'NOUN'), ('จตุจักร', 'PROPN'), ('ชี้แจง', 'VERB'), ('ว่า', 'SCONJ'), ('<s/>', 'PUNCT'), ('ได้', 'AUX'), ('นำ', 'VERB'), ('ป้ายประกาศ', 'NOUN'), ('เตือน', 'VERB'), ('ปลิง', 'NOUN'), ('ไป', 'VERB'), ('ปัก', 'VERB'), ('ตาม', 'ADP'), ('แหล่งน้ำ', 'NOUN'), (' \n', 'NOUN'), ('ใน', 'ADP'), ('เขต', 'NOUN'), ('อำเภอ', 'NOUN'), ('เมือง', 'NOUN'), ('<s/>', 'PUNCT'), ('จังหวัด', 'NOUN'), ('อ่างทอง', 'PROPN'), ('<s/>', 'PUNCT'), ('หลังจาก', 'SCONJ'), ('นาย', 'NOUN'), ('สุ', 'PROPN'), ('กิจ', 'NOUN'), ('<s/>', 'PUNCT'), ('อายุ', 'NOUN'), ('<s/>', 'PUNCT'), ('65 ', 'NUM'), ('ปี', 'NOUN'), ('<s/>', 'PUNCT'), ('ถูก', 'AUX'), ('ปลิง', 'VERB'), ('กัด', 'VERB'), ('แล้ว', 'ADV'), ('ไม่ได้', 'AUX'), ('ไป', 'VERB'), ('พบ', 'VERB'), ('แพทย์', 'NOUN'), ('<s/>', 'PUNCT')]

If you'd like to cast away the part of speech, just ask for the first part of the pair.

words = [word[0] for piece in pieces for word in piece]
print(words)
['สำนักงาน', 'เขต', 'จตุจักร', 'ชี้แจง', 'ว่า', '<s/>', 'ได้', 'นำ', 'ป้ายประกาศ', 'เตือน', 'ปลิง', 'ไป', 'ปัก', 'ตาม', 'แหล่งน้ำ', ' \n', 'ใน', 'เขต', 'อำเภอ', 'เมือง', '<s/>', 'จังหวัด', 'อ่างทอง', '<s/>', 'หลังจาก', 'นาย', 'สุ', 'กิจ', '<s/>', 'อายุ', '<s/>', '65 ', 'ปี', '<s/>', 'ถูก', 'ปลิง', 'กัด', 'แล้ว', 'ไม่ได้', 'ไป', 'พบ', 'แพทย์', '<s/>']

Vietnamese: pyvi#

For Vietnamese we'll use pyvi. There are plenty of other options, but the best ones all involve installing Java and separate packages. We'll use pyvi to keep it simple.

#!pip install pyvi

Weirdly, when you run the tokenize method you get a string back...

from pyvi import ViTokenizer, ViPosTagger

words = ViTokenizer.tokenize(u"Trường đại học bách khoa hà nội")
words
'Trường đại_học bách_khoa hà_nội'
words = words.split(" ")
words
['Trường', 'đại_học', 'bách_khoa', 'hà_nội']

But! If you're also hunting for parts of speech, you end up with a list.

ViPosTagger.postagging(ViTokenizer.tokenize(u"Trường đại học Bách Khoa Hà Nội"))
(['Trường', 'đại_học', 'Bách_Khoa', 'Hà_Nội'], ['N', 'N', 'Np', 'Np'])

You can split the words and parts of speech apart easily enough, if you need them in separate variables.

words, pos = ViPosTagger.postagging(ViTokenizer.tokenize(u"Trường đại học Bách Khoa Hà Nội"))
print('words are', words)
print('pos are', pos)
words are ['Trường', 'đại_học', 'Bách_Khoa', 'Hà_Nội']
pos are ['N', 'N', 'Np', 'Np']

If you'd like them matched up (like in some of the examples above), you can use zip to pair the word and the part of speech.

list(zip(words, pos))
[('Trường', 'N'), ('đại_học', 'N'), ('Bách_Khoa', 'Np'), ('Hà_Nội', 'Np')]

Review#

In this section we looked at tokenizing text in several different languages that can't just be split with spaces. To discover how to use these libraries with scikit-learn vectorizers, check out tutorial on how to make scikit-learn vectorizers work with Japanese, Chinese, and other East Asian languages page.