A collection of simple notebooks and guides to cover concepts in isolation, without the context of a journalistic project. Useful for review or to prep for topics used in the actual projects.
Readings and links
Notebooks, Assignments, and Walkthroughs
A breakdown of different kinds of text analysis
Need to find the top words in a single document? Python's Counter is an easy-to-use tool that can work for most of your word-counting tasks.
If you need to analyze or compare a set of documents, leveling up to scikit-learn for your text analysis needs is an excellent idea.
In languages that don't use spaces to separate words, text analysis needs an extra step that isn't mentioned in most English-language-focused texts. We'll review how to segment words in Chinese, Japanese, Korean, Thai, and Vietnamese.
Your name showing up once in a tweet is more important than it showing up once in a book, right? Well, maybe or maybe not! Let's examine term-frequency/inverse-document-frequency (TF-IDF) as a way of adjusting our text analysis.
TF-IDF works just the same no matter what language you're working on. Here we'll take a look at using it with Chinese.
Although we covered how to split words in East Asian languages, you need an extra step before they'll work with scikit-learn.
Many words share common roots. Two techniques for combining them when doing text analysis are stemming and lemmatization.
Learn how computers can begin to understand concepts and related words through "word embeddings."
Who or what is lurking in your documents? Named entity recognition can help!
While asking for document similarity based on shared words is great, it might be missing out on the concepts hiding inside the documents. Word embeddings is a way to link together words that aren't exact matches but are related to one another.
Let's teach a computer how to read documents and match similar ones, even if they are in completely different languages.
Instead of just looking at words one at a time in your text analysis, sometimes it's more useful to look at 2- or 3-word phrases (or even more!). These are called n-grams, and they're easy to pick apart using scikit-learns text processing tools.
Converting a cache of various document formats to plain, machine-readable text can be difficult. Apache Tika to the rescue! Tika will take *any* kind of document and convert it right on into text for you. It even does OCR of image-based PDFs!
If you'd like to use Apache Tika to convert documents to text with non-English languages, there's a slight adjustment or two that needs to be made. We'll dig into how to do that with examples from Greek.
While you'll usually train a computer to look for what you're interested in, you're also free to let it loose to read on its own.
Topic modeling requires one input - the number of topics you'd like to find. How can you make sure you're picking the right one?
Gensim is a popular library for topic modeling. Here we'll see how it stacks up to scikit-learn.
Topic models and clustering are both techniques for automatically learning about documents. How do they compare?