Breaking down a few different kinds of text analysis#

Natural language processing (NLP) is a wide wide field that encompasses everything involving language. We'll mostly be sticking with analyzing documents, but even then there are a hundred and one different things we can do. Let's break a few of them down.

(and yes, everything from books to tweets count as "documents")

Word counting#

Sometimes you just want to count some words. We outline both a simple technique as well as a more advanced version, too.

Topic modeling and clustering#

If you have no clue what a set of documents might be about, both topic modeling and clustering are approaches to getting a glance at what's inside. Topic modeling tries to find a set of topics that show up in the documents, while clustering organizes the documents into separate, discrete categories.

Entity extraction#

Sometimes you aren't looking for concepts, you're looking for actual people or things. Who is mentioned in that document dump? What companies are listed in a judge's conflict of interest filings? This is entity extraction.


When you have a large set of documents, you can often organize them into two (or more) categories: ones you're interested in and ones you aren't.

You might be trying to find comments mentioning bullying, or disciplinary orders about sexual abuse, or complaints mentioning airbags that malfunctioned in a specific way. Classification can help out in these situations, by having you train the computer what interesting and uninteresting documents look like. You read a portion and then let the computer explore the rest!

We cover classification under a different section, so you'll want to review how to count words first.

Sentiment analysis#

Positive or negative? Happy or sad? Sentiment analysis is the idea that you can extract emotional meaning based on what people have written. Often used for news stories or tweets, it's generally a subset of classification.

Document similarity#

Comparing two or more documents can be approached a few different ways. Are you looking for word-for-word plagiarism, or just similarity in concepts? The former you can do with simple counts, while the latter takes a leap into word embeddings.