{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Counting words with Python's Counter\n", "\n", "Like all things, **counting words using Python** can be done two different ways: the easy way or the hard way. Using the [Counter](https://pymotw.com/3/collections/counter.html) tool is the easy way!\n", "\n", "Counter is generally used for, well, counting things.\n", "\n", "## Getting started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<p class=\"reading-options\">\n <a class=\"btn\" href=\"/text-analysis/counting-words-with-pythons-counter\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/text-analysis/notebooks/Counting words with Python's Counter.ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Counting words with Python's Counter.ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({1: 3, 4: 2, 3: 4, 2: 3})" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "Counter([1, 4, 3, 2, 3, 3, 2, 1, 3, 4, 1, 2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a list of words, you can use it to count how many times each word appears." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'hello': 3, 'goodbye': 2, 'party': 1})" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(['hello', 'goodbye', 'goodbye', 'hello', 'hello', 'party'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to use it to count words in a normal piece of text, though, we'll have to **turn our text into a list of words.** We also need to do a little bit of cleanup - removing punctuation, making everything lowercase, just making sure the only things left are words." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cleaned sentence is: yesterday i went fishing i dont fish that often so i didnt catch any fish i was told id enjoy myself but it didnt really seem that fun\n" ] }, { "data": { "text/plain": [ "Counter({'yesterday': 1,\n", " 'i': 4,\n", " 'went': 1,\n", " 'fishing': 1,\n", " 'dont': 1,\n", " 'fish': 2,\n", " 'that': 2,\n", " 'often': 1,\n", " 'so': 1,\n", " 'didnt': 2,\n", " 'catch': 1,\n", " 'any': 1,\n", " 'was': 1,\n", " 'told': 1,\n", " 'id': 1,\n", " 'enjoy': 1,\n", " 'myself': 1,\n", " 'but': 1,\n", " 'it': 1,\n", " 'really': 1,\n", " 'seem': 1,\n", " 'fun': 1})" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "text = \"\"\"Yesterday I went fishing. I don't fish that often, \n", "so I didn't catch any fish. I was told I'd enjoy myself, \n", "but it didn't really seem that fun.\"\"\"\n", "\n", "# Force to all be lowercase because FISH and fish and Fish are the same\n", "text = text.lower()\n", "\n", "# Remove anything that isn't a word character or a space\n", "# We could use .replace(\".\", \"\") but regex is a lot easier!\n", "text = re.sub(\"[^\\w ]\", \"\", text)\n", "\n", "print(\"Cleaned sentence is:\", text)\n", "\n", "words = text.split(\" \")\n", "Counter(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a lot of text, you're usually only interested in the most common words. If you just want the top words, `.most_common` is going to be your best friend." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('i', 4), ('fish', 2), ('that', 2), ('didnt', 2), ('yesterday', 1)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(words).most_common(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Counting words in a book\n", "\n", "Now that we know the basics of how to clean text and do text analysis with `Counter`, let's try it with an actual book! We'll use Jane Austen's [Pride and Prejudice](http://www.gutenberg.org/cache/epub/42671/pg42671.txt)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "d to be any thing extraordinary now. When a woman has\r\n", "five grown up daughters, she ought to give over thinking of her own\r\n", "beauty.\"\r\n", "\r\n", "\"In such cases, a woman has not often much beauty to think of.\"\r\n", "\r\n", "\"But, my dear, you must indeed go and see Mr. Bingley when he comes into\r\n", "the neighbourhood.\"\r\n", "\r\n", "\"It is more than I engage for, I assure you.\"\r\n", "\r\n", "\"But consider your daughters. Only think what an es\n" ] } ], "source": [ "import requests\n", "\n", "response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')\n", "text = response.text\n", "\n", "print(text[4100:4500])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The easiest and most boring thing we can do is count the words in it. So, let's count the words in it." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 3751),\n", " ('to', 3746),\n", " ('of', 3298),\n", " ('', 3289),\n", " ('and', 3113),\n", " ('her', 1811),\n", " ('a', 1745),\n", " ('in', 1679),\n", " ('i', 1655),\n", " ('was', 1622),\n", " ('she', 1385),\n", " ('that', 1325),\n", " ('it', 1294),\n", " ('not', 1278),\n", " ('he', 1148),\n", " ('you', 1145),\n", " ('be', 1101),\n", " ('his', 1061),\n", " ('as', 1052),\n", " ('had', 1036)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = text.lower()\n", "text = re.sub(\"[^\\w ]\", \"\", text)\n", "\n", "words = text.split(\" \")\n", "Counter(words).most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Secret tricks with Counter\n", "\n", "Counting words is all fine and good, but if you have a little bit of regular expressions skills we can dig a little bit deeper!\n", "\n", "### Only extracting some words with regular expressions\n", "\n", "Do men and women do different things in this book? Let's look at `she ____` and `he ____` to see what we can find out!\n", "\n", "> `\\b` marks a **word boundary**, otherwise the phrase \"she talks\" would match both `she (\\w+)` and `he (\\w+)`" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['for', 'ought', 'is', 'was', 'was']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Catch every word after 'she'\n", "she_words = re.findall(r\"\\b[Ss]he (\\w+)\", text)\n", "she_words[:5]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['is', 'had', 'camedown', 'agreed', 'isto']" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Catch every word after 'he'\n", "he_words = re.findall(r\"\\b[Hh]e (\\w+)\", text)\n", "he_words[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Most common verbs\n", "\n", "Then we can use `.most_common` to get the top verbs for both men and women. While they aren't _necessarily_ verbs, they mostly should be." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('had', 139),\n", " ('was', 129),\n", " ('is', 54),\n", " ('has', 42),\n", " ('could', 32),\n", " ('would', 24),\n", " ('did', 24),\n", " ('should', 23),\n", " ('will', 21),\n", " ('must', 20),\n", " ('might', 17),\n", " ('replied', 14),\n", " ('said', 12),\n", " ('thought', 11),\n", " ('does', 10),\n", " ('may', 10),\n", " ('looked', 9),\n", " ('never', 9),\n", " ('came', 9),\n", " ('continued', 8)]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Most common words after 'she'\n", "Counter(he_words).most_common(20)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[('was', 165),\n", " ('had', 152),\n", " ('could', 102),\n", " ('is', 46),\n", " ('would', 44),\n", " ('did', 26),\n", " ('felt', 26),\n", " ('might', 21),\n", " ('has', 19),\n", " ('will', 16),\n", " ('saw', 16),\n", " ('added', 15),\n", " ('should', 14),\n", " ('said', 13),\n", " ('i', 11),\n", " ('must', 10),\n", " ('found', 10),\n", " ('cried', 10),\n", " ('spoke', 10),\n", " ('began', 9)]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Most common words after 'she'\n", "Counter(she_words).most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tada! It's a very, very naive example of text analysis, but at least it's a start." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparing top words\n", "\n", "Now that we have two datasets created with `Counter`, we can actually push them into a pandas dataframe and do a comparison.\n", "\n", "We'll get the raw counts into the `he` and `she` columns, and then do a little bit of calculating to get a percentage column." ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>he</th>\n", " <th>she</th>\n", " <th>total</th>\n", " <th>pct_she</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>is</th>\n", " <td>54.0</td>\n", " <td>46.0</td>\n", " <td>100.0</td>\n", " <td>46.000000</td>\n", " </tr>\n", " <tr>\n", " <th>had</th>\n", " <td>139.0</td>\n", " <td>152.0</td>\n", " <td>291.0</td>\n", " <td>52.233677</td>\n", " </tr>\n", " <tr>\n", " <th>camedown</th>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>agreed</th>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>isto</th>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " he she total pct_she\n", "is 54.0 46.0 100.0 46.000000\n", "had 139.0 152.0 291.0 52.233677\n", "camedown 1.0 0.0 1.0 0.000000\n", "agreed 1.0 0.0 1.0 0.000000\n", "isto 1.0 0.0 1.0 0.000000" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame({\n", " 'he': Counter(he_words),\n", " 'she': Counter(she_words) \n", "}).fillna(0)\n", "\n", "df['total'] = df.he + df.she\n", "df['pct_she'] = df.she / df.total * 100\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at **words used ten or more times**, sorted by how often they're done by women." ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>he</th>\n", " <th>she</th>\n", " <th>total</th>\n", " <th>pct_she</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>cried</th>\n", " <td>0.0</td>\n", " <td>10.0</td>\n", " <td>10.0</td>\n", " <td>100.000000</td>\n", " </tr>\n", " <tr>\n", " <th>felt</th>\n", " <td>3.0</td>\n", " <td>26.0</td>\n", " <td>29.0</td>\n", " <td>89.655172</td>\n", " </tr>\n", " <tr>\n", " <th>saw</th>\n", " <td>2.0</td>\n", " <td>16.0</td>\n", " <td>18.0</td>\n", " <td>88.888889</td>\n", " </tr>\n", " <tr>\n", " <th>i</th>\n", " <td>3.0</td>\n", " <td>11.0</td>\n", " <td>14.0</td>\n", " <td>78.571429</td>\n", " </tr>\n", " <tr>\n", " <th>could</th>\n", " <td>32.0</td>\n", " <td>102.0</td>\n", " <td>134.0</td>\n", " <td>76.119403</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " he she total pct_she\n", "cried 0.0 10.0 10.0 100.000000\n", "felt 3.0 26.0 29.0 89.655172\n", "saw 2.0 16.0 18.0 88.888889\n", "i 3.0 11.0 14.0 78.571429\n", "could 32.0 102.0 134.0 76.119403" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.total >= 10].sort_values(by='pct_she', ascending=False).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again: super naive text analysis, with a totally cherry-picked example to make \"cried\" and \"felt\" show up at the top. Feels like we did something cool, though, right? You can find [other books at Project Gutenberg](http://www.gutenberg.org/ebooks/) if you're interested in doing more.\n", "\n", "## Review\n", "\n", "We used Python's `Counter` tool to easily count words in a document or two. It also works well with pandas dataframes, allowing us to make simple comparisons." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }