Text analysis from None

In languages that don't use spaces to separate words, text analysis needs an extra step that isn't mentioned in most English-language-focused texts. We'll review how to segment words in Chinese, Japanese, Korean, Thai, and Vietnamese.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

A simple explanation of TF-IDF

Your name showing up once in a tweet is more important than it showing up once in a book, right? Well, maybe or maybe not! Let's examine term-frequency/inverse-document-frequency (TF-IDF) as a way of adjusting our text analysis.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

An explanation of TF-IDF with Chinese text

TF-IDF works just the same no matter what language you're working on. Here we'll take a look at using it with Chinese.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

How to make scikit-learn vectorizers work with Japanese, Chinese, and other East Asian languages

Although we covered how to split words in East Asian languages, you need an extra step before they'll work with scikit-learn.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Stemming and lemmatization

Many words share common roots. Two techniques for combining them when doing text analysis are stemming and lemmatization.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Intro to word embeddings

Learn how computers can begin to understand concepts and related words through "word embeddings."

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Named entity recognition

Who or what is lurking in your documents? Named entity recognition can help!

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Conceptual document similarity using word embeddings

While asking for document similarity based on shared words is great, it might be missing out on the concepts hiding inside the documents. Word embeddings is a way to link together words that aren't exact matches but are related to one another.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Document similarity over different languages

Let's teach a computer how to read documents and match similar ones, even if they are in completely different languages.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Explaining n-grams in Natural Language Processing

Instead of just looking at words one at a time in your text analysis, sometimes it's more useful to look at 2- or 3-word phrases (or even more!). These are called n-grams, and they're easy to pick apart using scikit-learns text processing tools.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Converting PDFs, Word docs, and HTML pages to text with Apache Tika

Converting a cache of various document formats to plain, machine-readable text can be difficult. Apache Tika to the rescue! Tika will take *any* kind of document and convert it right on into text for you. It even does OCR of image-based PDFs!

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Processing documents with Apache Tika in non-English languages

If you'd like to use Apache Tika to convert documents to text with non-English languages, there's a slight adjustment or two that needs to be made. We'll dig into how to do that with examples from Greek.

Read online

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook