Word-splitting and text segmentation in East Asian languages#
As different as they are, Chinese, Japanese and Korean are lumped together as CJK languages when discussed from an English-language point of view. One reason they're considered similar is that spacing is not used in the same way as in English. While analyzing English requires splitting sentences based on spaces, one difficulty for this language set (and others) is determining where the breaks between words are. In this section we will review how to use libraries to segment words in these languages as well as Thai and Vietnamese.
Using libraries#
Segmenting is the process of splitting a text into separate words. There isn't always a "right" answer as to what the split should be, so you might have to try a few different libraries before you feel a good fit. The recommendations below aren't necessarily the best Python packages, they're just ones that had a bit of activity, seemingly-decent interfaces/documentation, and no external installs.
Using these libraries with scikit-learn#
After reading this page, you might want to learn how to use these libraries with scikit-learn vectorizers. In that case, check out tutorial on how to make scikit-learn vectorizers work with Japanese, Chinese, and other East Asian languages page after you read this page.
Chinese: jieba#
The Chinese word segmentation library jieba is very popular when analyzing Chinese text.
#!pip install jieba
Jieba has a few different techniques we can use to segment words. We can read the documentation to get into details, but one major question is whether we want the smallest possible divisions.
Using lcut
gives us individual words and what might be considered noun phrases.
import jieba
jieba.lcut('我来到北京清华大学')
If we want to divide things up a bit more, we can add cut_all=True
.
jieba.lcut('我来到北京清华大学', cut_all=True)
The big difference is cut_all
will split something like 清华大学
into both 清华
and 华大
.
When might we use one compared to the other?
#!pip install nagisa
import nagisa
text = 'Pythonで簡単に使えるツールです'
doc = nagisa.tagging(text)
doc.words
In addition to simple tokenization, nagisa will also do part-of-speech tagging.
words.postags
This allows you to do things like pluck out all of the nouns
nouns = nagisa.extract(text, extract_postags=['名詞'])
nouns.words
Korean: KoNLPy#
Korean does use spaces, but certain characters combine with the "actual" words, so you can't just split on spaces. KoNLPy has several engines that will help you with this. See a comparison chart of the different engines here or find more specific details.
#!pip install konlpy
phrase = "아버지가방에들어가신다"
from konlpy.tag import Hannanum
hannanum = Hannanum()
hannanum.morphs(phrase)
from konlpy.tag import Kkma
kkma = Kkma()
kkma.morphs(phrase)
from konlpy.tag import Komoran
komoran = Komoran()
komoran.morphs(phrase)
Thai: tltk#
You can find many Thai NLP packages here, but we'll focus on tltk. It doesn't have the best documentation and it might not be the most accurate, but it doesn't require us to install anything extra (e.g. TensorFlow) and that's the absolutely only reason why we're using it.
#!pip install tltk
import tltk
phrase = """สำนักงานเขตจตุจักรชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ
ในเขตอำเภอเมือง จังหวัดอ่างทอง หลังจากนายสุกิจ อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์"""
pieces = tltk.nlp.pos_tag(phrase)
pieces
If you just want to split out everything individually, you'll need to jump through a tiny hoop.
words = [word for piece in pieces for word in piece]
print(words)
If you'd like to cast away the part of speech, just ask for the first part of the pair.
words = [word[0] for piece in pieces for word in piece]
print(words)
Vietnamese: pyvi#
For Vietnamese we'll use pyvi. There are plenty of other options, but the best ones all involve installing Java and separate packages. We'll use pyvi to keep it simple.
#!pip install pyvi
Weirdly, when you run the tokenize
method you get a string back...
from pyvi import ViTokenizer, ViPosTagger
words = ViTokenizer.tokenize(u"Trường đại học bách khoa hà nội")
words
words = words.split(" ")
words
But! If you're also hunting for parts of speech, you end up with a list.
ViPosTagger.postagging(ViTokenizer.tokenize(u"Trường đại học Bách Khoa Hà Nội"))
You can split the words and parts of speech apart easily enough, if you need them in separate variables.
words, pos = ViPosTagger.postagging(ViTokenizer.tokenize(u"Trường đại học Bách Khoa Hà Nội"))
print('words are', words)
print('pos are', pos)
If you'd like them matched up (like in some of the examples above), you can use zip
to pair the word and the part of speech.
list(zip(words, pos))
Review#
In this section we looked at tokenizing text in several different languages that can't just be split with spaces. To discover how to use these libraries with scikit-learn vectorizers, check out tutorial on how to make scikit-learn vectorizers work with Japanese, Chinese, and other East Asian languages page.