{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Counting words in Python with sklearn's CountVectorizer\n", "\n", "There are several ways to count words in Python: the easiest is probably to use a [Counter](https://pymotw.com/3/collections/counter.html)! We'll be covering another technique here, the CountVectorizer from [scikit-learn](https://scikit-learn.org/).\n", "\n", "CountVectorizer is a little more intense than using Counter, but don't let that frighten you off! If your project is more complicated than \"count the words in this book,\" the CountVectorizer might actually be easier in the long run." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<p class=\"reading-options\">\n <a class=\"btn\" href=\"/text-analysis/counting-words-with-scikit-learns-countvectorizer\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/text-analysis/notebooks/Counting words with scikit-learn's CountVectorizer.ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Counting words with scikit-learn's CountVectorizer.ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using CountVectorizer\n", "\n", "While Counter is used for counting all sorts of things, the CountVectorizer is specifically used for **counting words.** The **vectorizer** part of CountVectorizer is (technically speaking!) the process of converting text into some sort of number-y thing that computers can understand.\n", "\n", "Unfortunately, the \"number-y thing that computers can understand\" is kind of hard for us to understand. See below:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<1x20 sparse matrix of type '<class 'numpy.int64'>'\n", "\twith 20 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "# Build our text\n", "text = \"\"\"Yesterday I went fishing. I don't fish that often, \n", "so I didn't catch any fish. I was told I'd enjoy myself, \n", "but it didn't really seem that fun.\"\"\"\n", "\n", "vectorizer = CountVectorizer()\n", "\n", "matrix = vectorizer.fit_transform([text])\n", "matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to do a little magic to turn the results into a format we can understand." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>any</th>\n", " <th>but</th>\n", " <th>catch</th>\n", " <th>didn</th>\n", " <th>don</th>\n", " <th>enjoy</th>\n", " <th>fish</th>\n", " <th>fishing</th>\n", " <th>fun</th>\n", " <th>it</th>\n", " <th>myself</th>\n", " <th>often</th>\n", " <th>really</th>\n", " <th>seem</th>\n", " <th>so</th>\n", " <th>that</th>\n", " <th>told</th>\n", " <th>was</th>\n", " <th>went</th>\n", " <th>yesterday</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " any but catch didn don enjoy fish fishing fun it myself often \\\n", "0 1 1 1 2 1 1 2 1 1 1 1 1 \n", "\n", " really seem so that told was went yesterday \n", "0 1 1 1 2 1 1 1 1 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "counts = pd.DataFrame(matrix.toarray(),\n", " columns=vectorizer.get_feature_names())\n", "counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understanding CountVectorizer\n", "\n", "Let's break it down line by line.\n", "\n", "### Creating and using a vectorizer\n", "\n", "First, we made a new CountVectorizer. This is the thing that's going to understand and count the words for us. It has a _lot_ of different options, but we'll just use the normal, standard version for now." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we told the vectorizer to read the text for us." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<1x20 sparse matrix of type '<class 'numpy.int64'>'\n", "\twith 20 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix = vectorizer.fit_transform([text])\n", "matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notice that we gave it `[text]` instead of just `text`.** This is because sklearn is typically meant for the world of MACHINE LEARNING, where you're probably reading a lot of documents at once. Sklearn doesn't even want to deal with texts one at a time, so **we have to send it a list**.\n", "\n", "When we did `.fit_transform()`, this did two things:\n", "\n", "1. Found all of the different words in the text\n", "2. Counted how many of each there were\n", "\n", "The `matrix` variable it sent back is a big ugly thing just for computers. If we want to look at it, though, we can!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each of those numbers is how many times a word showed up - most words showed up one time, and some showed up twice. But how do we know which word is which?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['any', 'but', 'catch', 'didn', 'don', 'enjoy', 'fish', 'fishing', 'fun', 'it', 'myself', 'often', 'really', 'seem', 'so', 'that', 'told', 'was', 'went', 'yesterday']\n" ] } ], "source": [ "print(vectorizer.get_feature_names())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The order of the words matches the order of the numbers! First in the words list is `any`, and first in the numbers list is `1`. That means \"any\" showed up once. In the same way you can figure out that `fish` is the seventh word in the list, which (count to the seventh number) showed up `2` times.\n", "\n", "### Converting the output\n", "\n", "Reading the `matrix` output gets easier if we move it into a pandas dataframe." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>any</th>\n", " <th>but</th>\n", " <th>catch</th>\n", " <th>didn</th>\n", " <th>don</th>\n", " <th>enjoy</th>\n", " <th>fish</th>\n", " <th>fishing</th>\n", " <th>fun</th>\n", " <th>it</th>\n", " <th>myself</th>\n", " <th>often</th>\n", " <th>really</th>\n", " <th>seem</th>\n", " <th>so</th>\n", " <th>that</th>\n", " <th>told</th>\n", " <th>was</th>\n", " <th>went</th>\n", " <th>yesterday</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " any but catch didn don enjoy fish fishing fun it myself often \\\n", "0 1 1 1 2 1 1 2 1 1 1 1 1 \n", "\n", " really seem so that told was went yesterday \n", "0 1 1 1 2 1 1 1 1 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts = pd.DataFrame(matrix.toarray(),\n", " columns=vectorizer.get_feature_names())\n", "counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to see a sorted list similar to what Counter gave us, though, we need to do a little shifting around." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>didn</th>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>fish</th>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>that</th>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>any</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>often</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>went</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>was</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>told</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>so</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>seem</th>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0\n", "didn 2\n", "fish 2\n", "that 2\n", "any 1\n", "often 1\n", "went 1\n", "was 1\n", "told 1\n", "so 1\n", "seem 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts.T.sort_values(by=0, ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's **something a little weird about this.** `didn` isn't a word - it should be `didn't`, right? And `i` isn't in our list, even though the first sentence is \"I went fishing yesterday.\" The reasons why:\n", "\n", "* By default, the CountVectorizer splits words on punctuation, so `didn't` becomes two words - `didn` and `t`. Their argument is that it's [actually \"did not\"](https://github.com/nltk/nltk/issues/401) and shouldn't be kept together. You can read more about this [right here](http://www.nltk.org/book/ch03.html#sec-tokenization).\n", "* By default, the CountVectorizer also **only uses words that are 2 or more letters.** So `i` doesn't make the cute, nor does the `t` up above.\n", "\n", "### Customizing CountVectorizer\n", "\n", "We don't have a good solution to the first one, but we can customize CountVectorizer to include 1-character words." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>any</th>\n", " <th>but</th>\n", " <th>catch</th>\n", " <th>d</th>\n", " <th>didn</th>\n", " <th>don</th>\n", " <th>enjoy</th>\n", " <th>fish</th>\n", " <th>fishing</th>\n", " <th>fun</th>\n", " <th>...</th>\n", " <th>often</th>\n", " <th>really</th>\n", " <th>seem</th>\n", " <th>so</th>\n", " <th>t</th>\n", " <th>that</th>\n", " <th>told</th>\n", " <th>was</th>\n", " <th>went</th>\n", " <th>yesterday</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>...</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>1 rows \u00d7 23 columns</p>\n", "</div>" ], "text/plain": [ " any but catch d didn don enjoy fish fishing fun ... often \\\n", "0 1 1 1 1 2 1 1 2 1 1 ... 1 \n", "\n", " really seem so t that told was went yesterday \n", "0 1 1 1 3 2 1 1 1 1 \n", "\n", "[1 rows x 23 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer = CountVectorizer(token_pattern=r\"(?u)\\b\\w+\\b\")\n", "\n", "matrix = vectorizer.fit_transform([text])\n", "counts = pd.DataFrame(matrix.toarray(),\n", " columns=vectorizer.get_feature_names())\n", "\n", "counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This ability to customize `CountVectorizer` means for even intermediate text analysis it's usually more useful than `Counter`. \n", "\n", "This was a boring example that makes CountVectorizer seem like trouble, but it has [a lot of other options](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) we aren't dealing with, too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CountVectorizer in practice\n", "\n", "### Counting words in a book\n", "\n", "Now that we know the basics of how to clean text and do text analysis with `CountVectorizer`, let's try it with an actual book! We'll use Jane Austen's [Pride and Prejudice](http://www.gutenberg.org/cache/epub/42671/pg42671.txt)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "d to be any thing extraordinary now. When a woman has\r\n", "five grown up daughters, she ought to give over thinking of her own\r\n", "beauty.\"\r\n", "\r\n", "\"In such cases, a woman has not often much beauty to think of.\"\r\n", "\r\n", "\"But, my dear, you must indeed go and see Mr. Bingley when he comes into\r\n", "the neighbourhood.\"\r\n", "\r\n", "\"It is more than I engage for, I assure you.\"\r\n", "\r\n", "\"But consider your daughters. Only think what an establishment it would\r\n", "be for one of them. Sir William and Lady Lucas are determined to go,\r\n", "merely o\n" ] } ], "source": [ "import requests\n", "\n", "# Download the book\n", "response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')\n", "text = response.text\n", "\n", "# Look at some text in the middle\n", "print(text[4100:4600])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To count the words in the book, we're going to use the **same code we used before**. Since we have new content in `text`, we can 100% cut-and-paste." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>the</th>\n", " <td>4520</td>\n", " </tr>\n", " <tr>\n", " <th>to</th>\n", " <td>4242</td>\n", " </tr>\n", " <tr>\n", " <th>of</th>\n", " <td>3749</td>\n", " </tr>\n", " <tr>\n", " <th>and</th>\n", " <td>3662</td>\n", " </tr>\n", " <tr>\n", " <th>her</th>\n", " <td>2205</td>\n", " </tr>\n", " <tr>\n", " <th>in</th>\n", " <td>1941</td>\n", " </tr>\n", " <tr>\n", " <th>was</th>\n", " <td>1846</td>\n", " </tr>\n", " <tr>\n", " <th>she</th>\n", " <td>1689</td>\n", " </tr>\n", " <tr>\n", " <th>that</th>\n", " <td>1566</td>\n", " </tr>\n", " <tr>\n", " <th>it</th>\n", " <td>1549</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0\n", "the 4520\n", "to 4242\n", "of 3749\n", "and 3662\n", "her 2205\n", "in 1941\n", "was 1846\n", "she 1689\n", "that 1566\n", "it 1549" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer = CountVectorizer()\n", "\n", "matrix = vectorizer.fit_transform([text])\n", "counts = pd.DataFrame(matrix.toarray(),\n", " columns=vectorizer.get_feature_names())\n", "\n", "# Show us the top 10 most common words\n", "counts.T.sort_values(by=0, ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How often is **love** used?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 92\n", "Name: love, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts['love']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about **hate**?" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 9\n", "Name: hate, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts['hate']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Counting words in multiple books\n", "\n", "Remember how i said CountVectorizer is better at multiple pieces of text? Let's use that ability! We'll use a few:\n", "\n", "* [Pride and Prejudice](http://www.gutenberg.org/cache/epub/42671/pg42671.txt)\n", "* [Frankenstein](https://www.gutenberg.org/files/84/84-0.txt)\n", "* [Dr. Jekyll and Mr. Hyde](https://www.gutenberg.org/files/43/43-0.txt)\n", "* [Great Expectations](https://www.gutenberg.org/files/1400/1400-0.txt)\n", "\n", "We'll create a dataframe out of the name and URL, then grab the contents of the books from the URL." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>name</th>\n", " <th>url</th>\n", " <th>content</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Pride and Prejudice</td>\n", " <td>http://www.gutenberg.org/cache/epub/42671/pg42...</td>\n", " <td>\ufeffThe Project Gutenberg eBook, Pride and Prejud...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Frankenstein</td>\n", " <td>https://www.gutenberg.org/files/84/84-0.txt</td>\n", " <td>\u00ef\u00bb\u00bf\\r\\nProject Gutenberg's Frankenstein, by Ma...</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Dr. Jekyll and Mr. Hyde</td>\n", " <td>https://www.gutenberg.org/files/43/43-0.txt</td>\n", " <td>\\r\\nThe Project Gutenberg EBook of The Strange...</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Great Expectations</td>\n", " <td>https://www.gutenberg.org/files/1400/1400-0.txt</td>\n", " <td>\u00ef\u00bb\u00bfThe Project Gutenberg EBook of Great Expect...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " name url \\\n", "0 Pride and Prejudice http://www.gutenberg.org/cache/epub/42671/pg42... \n", "1 Frankenstein https://www.gutenberg.org/files/84/84-0.txt \n", "2 Dr. Jekyll and Mr. Hyde https://www.gutenberg.org/files/43/43-0.txt \n", "3 Great Expectations https://www.gutenberg.org/files/1400/1400-0.txt \n", "\n", " content \n", "0 \ufeffThe Project Gutenberg eBook, Pride and Prejud... \n", "1 \u00ef\u00bb\u00bf\\r\\nProject Gutenberg's Frankenstein, by Ma... \n", "2 \\r\\nThe Project Gutenberg EBook of The Strange... \n", "3 \u00ef\u00bb\u00bfThe Project Gutenberg EBook of Great Expect... " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Build our dataframe\n", "df = pd.DataFrame([\n", " { 'name': 'Pride and Prejudice', 'url': 'http://www.gutenberg.org/cache/epub/42671/pg42671.txt' },\n", " { 'name': 'Frankenstein', 'url': 'https://www.gutenberg.org/files/84/84-0.txt' },\n", " { 'name': 'Dr. Jekyll and Mr. Hyde', 'url': 'https://www.gutenberg.org/files/43/43-0.txt' },\n", " { 'name': 'Great Expectations', 'url': 'https://www.gutenberg.org/files/1400/1400-0.txt' },\n", "])\n", "\n", "# Download the contents of the book, put it in the 'content' column\n", "df['content'] = df.url.apply(lambda url: requests.get(url).text)\n", "\n", "# How'd it turn out?\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we just feed it to the CountVectorizer, and we get a nice organized dataframe of the words counted in each book!" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>000</th>\n", " <th>10</th>\n", " <th>10_th</th>\n", " <th>11</th>\n", " <th>11th</th>\n", " <th>12</th>\n", " <th>12th</th>\n", " <th>13</th>\n", " <th>13th</th>\n", " <th>14</th>\n", " <th>...</th>\n", " <th>yourselves</th>\n", " <th>youth</th>\n", " <th>youthful</th>\n", " <th>youthfulness</th>\n", " <th>youths</th>\n", " <th>you\u00e2</th>\n", " <th>zeal</th>\n", " <th>zealous</th>\n", " <th>zest</th>\n", " <th>zip</th>\n", " </tr>\n", " <tr>\n", " <th>name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Pride and Prejudice</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>2</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>Frankenstein</th>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>...</td>\n", " <td>1</td>\n", " <td>21</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Dr. Jekyll and Mr. Hyde</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Great Expectations</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>2</td>\n", " <td>9</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>4 rows \u00d7 16183 columns</p>\n", "</div>" ], "text/plain": [ " 000 10 10_th 11 11th 12 12th 13 13th 14 \\\n", "name \n", "Pride and Prejudice 1 0 0 0 0 0 0 0 0 0 \n", "Frankenstein 1 2 0 2 2 2 2 3 1 2 \n", "Dr. Jekyll and Mr. Hyde 1 0 1 0 0 0 1 0 0 0 \n", "Great Expectations 1 0 0 0 0 0 0 0 0 0 \n", "\n", " ... yourselves youth youthful youthfulness \\\n", "name ... \n", "Pride and Prejudice ... 2 9 0 0 \n", "Frankenstein ... 1 21 3 0 \n", "Dr. Jekyll and Mr. Hyde ... 0 2 0 0 \n", "Great Expectations ... 2 9 2 1 \n", "\n", " youths you\u00e2 zeal zealous zest zip \n", "name \n", "Pride and Prejudice 1 0 0 0 0 3 \n", "Frankenstein 0 1 4 0 0 1 \n", "Dr. Jekyll and Mr. Hyde 0 1 0 0 0 1 \n", "Great Expectations 0 0 2 2 1 1 \n", "\n", "[4 rows x 16183 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer = CountVectorizer()\n", "\n", "# Use the content column instead of our single text variable\n", "matrix = vectorizer.fit_transform(df.content)\n", "counts = pd.DataFrame(matrix.toarray(),\n", " index=df.name,\n", " columns=vectorizer.get_feature_names())\n", "\n", "counts.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can even use it to select a interesting words out of each!" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>love</th>\n", " <th>hate</th>\n", " <th>murder</th>\n", " <th>terror</th>\n", " <th>cried</th>\n", " <th>food</th>\n", " <th>dead</th>\n", " <th>sister</th>\n", " <th>husband</th>\n", " <th>wife</th>\n", " </tr>\n", " <tr>\n", " <th>name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Pride and Prejudice</th>\n", " <td>92</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>91</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>217</td>\n", " <td>50</td>\n", " <td>47</td>\n", " </tr>\n", " <tr>\n", " <th>Frankenstein</th>\n", " <td>59</td>\n", " <td>9</td>\n", " <td>21</td>\n", " <td>10</td>\n", " <td>15</td>\n", " <td>27</td>\n", " <td>23</td>\n", " <td>26</td>\n", " <td>2</td>\n", " <td>11</td>\n", " </tr>\n", " <tr>\n", " <th>Dr. Jekyll and Mr. Hyde</th>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>10</td>\n", " <td>12</td>\n", " <td>11</td>\n", " <td>0</td>\n", " <td>13</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Great Expectations</th>\n", " <td>60</td>\n", " <td>4</td>\n", " <td>20</td>\n", " <td>28</td>\n", " <td>60</td>\n", " <td>8</td>\n", " <td>49</td>\n", " <td>170</td>\n", " <td>16</td>\n", " <td>27</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " love hate murder terror cried food dead \\\n", "name \n", "Pride and Prejudice 92 9 0 0 91 0 5 \n", "Frankenstein 59 9 21 10 15 27 23 \n", "Dr. Jekyll and Mr. Hyde 3 1 10 12 11 0 13 \n", "Great Expectations 60 4 20 28 60 8 49 \n", "\n", " sister husband wife \n", "name \n", "Pride and Prejudice 217 50 47 \n", "Frankenstein 26 2 11 \n", "Dr. Jekyll and Mr. Hyde 0 0 1 \n", "Great Expectations 170 16 27 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts[['love', 'hate', 'murder', 'terror', 'cried', 'food', 'dead', 'sister', 'husband', 'wife']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although though Python's **Counter** might be easier in situations where we're just looking at one piece of text and have time to clean it up, if you're looking to do more heavy lifting (including machine learning!) you'll want to turn to scikit-learn's vectorizers.\n", "\n", "While we talked at length about CountVectorizer here, TfidfVectorizer is another common one that will take into account how often a word is used, and whether your texts are book-long or tweet-short." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "We covered how to count words in documents with scikit-learn's **CountVectorizer**. It works best with multiple documents at once and is lot more complicated than working with Python's Counter. \n", "\n", "We'll forgive CountVectorizer for its complexity because it's the foundation of a lot of machine learning and text analysis that we'll cover later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }