Text analysis

A collection of simple notebooks and guides to cover concepts in isolation, without the context of a journalistic project. Useful for review or to prep for topics used in the actual projects.

linear regression

Notebooks, Assignments, and Walkthroughs

Introduction to text analysis

A breakdown of different kinds of text analysis

Counting words with Python's Counter

Need to find the top words in a single document? Python's Counter is an easy-to-use tool that can work for most of your word-counting tasks.

Counting words with scikit-learn's CountVectorizer

If you need to analyze or compare a set of documents, leveling up to scikit-learn for your text analysis needs is an excellent idea.

Word-splitting in East Asian languages

In languages that don't use spaces to separate words, text analysis needs an extra step that isn't mentioned in most English-language-focused texts. We'll review how to segment words in Chinese, Japanese, Korean, Thai, and Vietnamese.

A simple explanation of TF-IDF

Your name showing up once in a tweet is more important than it showing up once in a book, right? Well, maybe or maybe not! Let's examine term-frequency/inverse-document-frequency (TF-IDF) as a way of adjusting our text analysis.

An explanation of TF-IDF with Chinese text

TF-IDF works just the same no matter what language you're working on. Here we'll take a look at using it with Chinese.

How to make scikit-learn vectorizers work with Japanese, Chinese, and other East Asian languages

Although we covered how to split words in East Asian languages, you need an extra step before they'll work with scikit-learn.

Stemming and lemmatization

Many words share common roots. Two techniques for combining them when doing text analysis are stemming and lemmatization.

Intro to word embeddings

Learn how computers can begin to understand concepts and related words through "word embeddings."

Named entity recognition

Who or what is lurking in your documents? Named entity recognition can help!

Conceptual document similarity using word embeddings

While asking for document similarity based on shared words is great, it might be missing out on the concepts hiding inside the documents. Word embeddings is a way to link together words that aren't exact matches but are related to one another.

Document similarity over different languages

Let's teach a computer how to read documents and match similar ones, even if they are in completely different languages.

Explaining n-grams in Natural Language Processing

Instead of just looking at words one at a time in your text analysis, sometimes it's more useful to look at 2- or 3-word phrases (or even more!). These are called n-grams, and they're easy to pick apart using scikit-learns text processing tools.

Converting PDFs, Word docs, and HTML pages to text with Apache Tika

Converting a cache of various document formats to plain, machine-readable text can be difficult. Apache Tika to the rescue! Tika will take *any* kind of document and convert it right on into text for you. It even does OCR of image-based PDFs!

Processing documents with Apache Tika in non-English languages

If you'd like to use Apache Tika to convert documents to text with non-English languages, there's a slight adjustment or two that needs to be made. We'll dig into how to do that with examples from Greek.

Introduction to topic modeling

While you'll usually train a computer to look for what you're interested in, you're also free to let it loose to read on its own.

Choosing the right number of topics for scikit-learn topic modeling

Topic modeling requires one input - the number of topics you'd like to find. How can you make sure you're picking the right one?

Topic modeling with Gensim

Gensim is a popular library for topic modeling. Here we'll see how it stacks up to scikit-learn.

Topic modeling and clustering

Topic models and clustering are both techniques for automatically learning about documents. How do they compare?