{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Using topic modeling to extract topics from documents\n",
                "\n",
                "Sometimes you have a nice big set of documents, and all you wish for is to know what's hiding inside. But without reading them, of course! Two approaches to try to lazily get some information from your texts are **topic modeling** and **clustering**."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<p class=\"reading-options\">\n  <a class=\"btn\" href=\"/text-analysis/introduction-to-topic-modeling\">\n    <i class=\"fa fa-sm fa-book\"></i>\n    Read online\n  </a>\n  <a class=\"btn\" href=\"/text-analysis/notebooks/Introduction to topic modeling.ipynb\">\n    <i class=\"fa fa-sm fa-download\"></i>\n    Download notebook\n  </a>\n  <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Introduction to topic modeling.ipynb\" target=\"_new\">\n    <i class=\"fa fa-sm fa-laptop\"></i>\n    Interactive version\n  </a>\n</p>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Prep work: Downloading necessary files\n",
                "Before we get started, we need to download all of the data we'll be using.\n",
                "* **recipes.csv:** recipes - a list of recipes (but only with ingredient names)\n",
                "* **state-of-the-union.csv:** State of the Union addresses - each presidential address from 1970 to 2012\n"
            ]
        },
        {
            "cell_type": "code",
            "metadata": {},
            "source": [
                "# Make data directory if it doesn't exist\n",
                "!mkdir -p data\n",
                "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/recipes.csv -P data\n",
                "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/state-of-the-union.csv -P data"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## How computers read\n",
                "\n",
                "I'm going to tell you a big secret: **computers are really really really bad at reading documents and figuring out what they're about.** Text is for _people_ to read, people with a collective knowledge of The World At Large and a history of reading things and all kinds of other tricky secret little things we don't think about that help us understand what a piece of text means.\n",
                "\n",
                "When dealing with understanding content, computers are good for _very specific situations_ to do _very specific things_. Or alternatively, to do a not-that-great job when you aren't going to be terribly picky about the results.\n",
                "\n",
                "Do I sound a little biased? Oh, but aren't we all. It isn't going to stop us from talking about it, though!\n",
                "\n",
                "Before we start, **let's make some assumptions:**\n",
                "\n",
                "* When you're dealing with documents, each document is (typically) about something.\n",
                "* You know each document is about by looking at the words in the document.\n",
                "* Documents with similar words are probably about similar things. \n",
                "\n",
                "We have two major options available to us: **topic modeling** and **clustering**. There's a lot of NLP nuance going on between the two, but we're going to keep it simple:\n",
                "\n",
                "**Topic modeling** is if each document can be about **multiple topics**. There might be 100 different topics, and a document might be 30% about one topic, 20% about another, and then 50% spread out between the others.\n",
                "\n",
                "**Clustering** is if each document should only fit into **one topic**. It's an all-or-nothing approach.\n",
                "\n",
                "The most important part of _all of this_ is the fact that **the computer figures out these topics by itself**. You don't tell it what to do! If you're teaching the algorithm what different specific topics look like, that's **classification.** In this case we're just saying \"hey computer, please figure this out!\"\n",
                "\n",
                "Let's get started."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 316,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import matplotlib.pyplot as plt\n",
                "\n",
                "# These styles look nicer than default pandas\n",
                "plt.style.use('ggplot')\n",
                "\n",
                "# We'll be able to see more text at once\n",
                "pd.set_option(\"display.max_colwidth\", 100)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Attempt one: Recipes\n",
                "\n",
                "### Our dataset\n",
                "\n",
                "We're going to start with analyzing **about 36,000 recipes**. Food is interesting because you can split it so many ways: by courses, or by baked goods vs meat vs vegetables vs others, by national cuisine..."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 317,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>cuisine</th>\n",
                            "      <th>id</th>\n",
                            "      <th>ingredient_list</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>greek</td>\n",
                            "      <td>10259</td>\n",
                            "      <td>romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>southern_us</td>\n",
                            "      <td>25693</td>\n",
                            "      <td>plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, ye...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>filipino</td>\n",
                            "      <td>20130</td>\n",
                            "      <td>eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powde...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>indian</td>\n",
                            "      <td>22213</td>\n",
                            "      <td>water, vegetable oil, wheat, salt</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>indian</td>\n",
                            "      <td>13162</td>\n",
                            "      <td>black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lem...</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "       cuisine     id  \\\n",
                            "0        greek  10259   \n",
                            "1  southern_us  25693   \n",
                            "2     filipino  20130   \n",
                            "3       indian  22213   \n",
                            "4       indian  13162   \n",
                            "\n",
                            "                                                                                       ingredient_list  \n",
                            "0  romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo...  \n",
                            "1  plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, ye...  \n",
                            "2  eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powde...  \n",
                            "3                                                                    water, vegetable oil, wheat, salt  \n",
                            "4  black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lem...  "
                        ]
                    },
                    "execution_count": 317,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "recipes = pd.read_csv(\"data/recipes.csv\")\n",
                "recipes.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "In order to analyze the text, we'll need to count the words in each recipe. To do that we're going to use a **stemmed TF-IDF vectorizer** from scikit-learn.\n",
                "\n",
                "* **Stemming** will allow us to combine words like `tomato` and `tomatoes`\n",
                "* Using **TF-IDF** will allow us to devalue common ingredients like salt and water\n",
                "\n",
                "I'm using the code from [the reference section](https://investigate.ai/reference/vectorizing/#stem-and-vectorize), just adjusted from a `CountVectorizer` to a `TfidfVectorizer`, and set it so ingredients have to appear in at least **fifty recipes**."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 318,
            "metadata": {},
            "outputs": [],
            "source": [
                "from sklearn.feature_extraction.text import TfidfVectorizer\n",
                "import Stemmer\n",
                "\n",
                "# English stemmer from pyStemmer\n",
                "stemmer = Stemmer.Stemmer('en')\n",
                "\n",
                "analyzer = TfidfVectorizer().build_analyzer()\n",
                "\n",
                "# Override TfidfVectorizer\n",
                "class StemmedTfidfVectorizer(TfidfVectorizer):\n",
                "    def build_analyzer(self):\n",
                "        analyzer = super(TfidfVectorizer, self).build_analyzer()\n",
                "        return lambda doc: stemmer.stemWords(analyzer(doc))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 319,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>activ</th>\n",
                            "      <th>adobo</th>\n",
                            "      <th>agav</th>\n",
                            "      <th>alfredo</th>\n",
                            "      <th>all</th>\n",
                            "      <th>allspic</th>\n",
                            "      <th>almond</th>\n",
                            "      <th>amchur</th>\n",
                            "      <th>anaheim</th>\n",
                            "      <th>ancho</th>\n",
                            "      <th>...</th>\n",
                            "      <th>wrapper</th>\n",
                            "      <th>yam</th>\n",
                            "      <th>yeast</th>\n",
                            "      <th>yellow</th>\n",
                            "      <th>yoghurt</th>\n",
                            "      <th>yogurt</th>\n",
                            "      <th>yolk</th>\n",
                            "      <th>yukon</th>\n",
                            "      <th>zest</th>\n",
                            "      <th>zucchini</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.278745</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.276000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.210575</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "<p>5 rows \u00d7 752 columns</p>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   activ  adobo  agav  alfredo  all  allspic  almond  amchur  anaheim  ancho  \\\n",
                            "0    0.0    0.0   0.0      0.0  0.0      0.0     0.0     0.0      0.0    0.0   \n",
                            "1    0.0    0.0   0.0      0.0  0.0      0.0     0.0     0.0      0.0    0.0   \n",
                            "2    0.0    0.0   0.0      0.0  0.0      0.0     0.0     0.0      0.0    0.0   \n",
                            "3    0.0    0.0   0.0      0.0  0.0      0.0     0.0     0.0      0.0    0.0   \n",
                            "4    0.0    0.0   0.0      0.0  0.0      0.0     0.0     0.0      0.0    0.0   \n",
                            "\n",
                            "   ...  wrapper  yam  yeast    yellow  yoghurt    yogurt  yolk  yukon  zest  \\\n",
                            "0  ...      0.0  0.0    0.0  0.000000      0.0  0.000000   0.0    0.0   0.0   \n",
                            "1  ...      0.0  0.0    0.0  0.278745      0.0  0.000000   0.0    0.0   0.0   \n",
                            "2  ...      0.0  0.0    0.0  0.276000      0.0  0.000000   0.0    0.0   0.0   \n",
                            "3  ...      0.0  0.0    0.0  0.000000      0.0  0.000000   0.0    0.0   0.0   \n",
                            "4  ...      0.0  0.0    0.0  0.000000      0.0  0.210575   0.0    0.0   0.0   \n",
                            "\n",
                            "   zucchini  \n",
                            "0       0.0  \n",
                            "1       0.0  \n",
                            "2       0.0  \n",
                            "3       0.0  \n",
                            "4       0.0  \n",
                            "\n",
                            "[5 rows x 752 columns]"
                        ]
                    },
                    "execution_count": 319,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "vectorizer = StemmedTfidfVectorizer(min_df=50)\n",
                "matrix = vectorizer.fit_transform(recipes.ingredient_list)\n",
                "\n",
                "words_df = pd.DataFrame(matrix.toarray(),\n",
                "                        columns=vectorizer.get_feature_names())\n",
                "words_df.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Looks like we have 752 ingredients! Yes, there are some numbers in there and probably other things we aren't interested in, but let's stick with it for now."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Topic modeling\n",
                "\n",
                "There are multiple techniques for topic modeling, but in the end they do the same thing: **you get a list of topics, and a list of words associated with each topic.**\n",
                "\n",
                "Let's tell it to break them down into **five topics.**"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 320,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,\n",
                            "    n_components=5, random_state=None, shuffle=False, solver='cd', tol=0.0001,\n",
                            "    verbose=0)"
                        ]
                    },
                    "execution_count": 320,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "from sklearn.decomposition import NMF\n",
                "\n",
                "model = NMF(n_components=5)\n",
                "model.fit(matrix)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Why five topics? **Because we have to tell it _something_.** Our job is to decide the number of topics, and it's the computer's job to find the topics. We'll talk about how to pick the \"right\" number later, but for now: it's magic.\n",
                "\n",
                "Fitting the model allowed it to \"learn\" what the ingredients are and how they're organized, we just need to find out what's inside. Let's ask for the **top ten terms in each group.**"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 321,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Topic 0: oliv pepper fresh oil dri garlic salt parsley red tomato\n",
                        "Topic 1: flour egg sugar purpos all butter bake milk larg powder\n",
                        "Topic 2: sauc soy sesam rice oil ginger sugar chicken vinegar garlic\n",
                        "Topic 3: ground chili cilantro cumin powder lime onion pepper chop fresh\n",
                        "Topic 4: chees shred cream parmesan cheddar grate tortilla mozzarella sour chicken\n"
                    ]
                }
            ],
            "source": [
                "n_words = 10\n",
                "feature_names = vectorizer.get_feature_names()\n",
                "\n",
                "topic_list = []\n",
                "for topic_idx, topic in enumerate(model.components_):\n",
                "    top_features = [feature_names[i] for i in topic.argsort()][::-1][:n_words]\n",
                "    top_n = ' '.join(top_features)\n",
                "    topic_list.append(f\"topic_{'_'.join(top_features[:3])}\") \n",
                "\n",
                "    print(f\"Topic {topic_idx}: {top_n}\")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 322,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "['topic_oliv_pepper_fresh', 'topic_flour_egg_sugar', 'topic_sauc_soy_sesam', 'topic_ground_chili_cilantro', 'topic_chees_shred_cream']\n"
                    ]
                }
            ],
            "source": [
                "print(topic_list)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Those actually seem like _pretty good topics_. Italian-ish, then baking, then Chinese, maybe Latin American or Indian food, and then dairy. What if we did it with **fifteen topics** instead?"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 323,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Topic 0: pepper bell red green onion celeri flake black tomato crush\n",
                        "Topic 1: flour purpos all bake powder butter soda buttermilk salt egg\n",
                        "Topic 2: sauc soy sesam oil ginger rice sugar garlic scallion starch\n",
                        "Topic 3: tortilla cream shred chees sour cheddar salsa corn bean jack\n",
                        "Topic 4: chees parmesan grate mozzarella pasta ricotta basil italian fresh spinach\n",
                        "Topic 5: lime cilantro fresh chop juic jalapeno chile avocado chili fish\n",
                        "Topic 6: chicken breast boneless skinless broth halv sodium low fat thigh\n",
                        "Topic 7: ground black pepper cumin cinnamon salt beef cayenn kosher paprika\n",
                        "Topic 8: chili seed powder cumin coriand masala garam curri ginger coconut\n",
                        "Topic 9: sugar egg vanilla milk extract larg cream butter yolk unsalt\n",
                        "Topic 10: oliv extra virgin oil clove garlic fresh salt tomato parsley\n",
                        "Topic 11: white wine vinegar rice shallot red salt grain mustard sugar\n",
                        "Topic 12: dri oregano tomato thyme parsley garlic bay basil leaf onion\n",
                        "Topic 13: lemon juic fresh orang zest parsley grate mint peel yogurt\n",
                        "Topic 14: water yeast warm sugar salt cold flour activ boil ice\n"
                    ]
                }
            ],
            "source": [
                "model = NMF(n_components=15)\n",
                "model.fit(matrix)\n",
                "\n",
                "n_words = 10\n",
                "feature_names = vectorizer.get_feature_names()\n",
                "\n",
                "topic_list = []\n",
                "for topic_idx, topic in enumerate(model.components_):\n",
                "    top_n = [feature_names[i]\n",
                "             for i in topic.argsort()\n",
                "             [-n_words:]][::-1]\n",
                "    top_features = ' '.join(top_n)\n",
                "    topic_list.append(f\"topic_{'_'.join(top_n[:3])}\") \n",
                "\n",
                "    print(f\"Topic {topic_idx}: {top_features}\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "This is where we start to see **the big difference between categories and topics**. The grouping with five groups seemed very much like cuisines - Italian, Chinese, etc. But now that we're breaking it down further, the groups have changed a bit.\n",
                "\n",
                "They're now **more like classes of ingredients.** Baking gets a category - `chicken breast boneless skinless` and so do generic Mediterranean ingredients - `oliv extra virgin oil clove garlic fresh salt`. The algorithm got a little confused about black pepper vs. hot pepper flakes vs green/yellow bell peppers when it created `pepper bell red green onion celeri flake black`, but we understand what it's going for.\n",
                "\n",
                "Remember, the important thing about topic modeling is that every row in our dataset is a **combinations of topics**. It might be a little bit about one thing, a little bit less about another, etc etc. Let's take a look at how that works."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 324,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>topic_pepper_bell_red</th>\n",
                            "      <th>topic_flour_purpos_all</th>\n",
                            "      <th>topic_sauc_soy_sesam</th>\n",
                            "      <th>topic_tortilla_cream_shred</th>\n",
                            "      <th>topic_chees_parmesan_grate</th>\n",
                            "      <th>topic_lime_cilantro_fresh</th>\n",
                            "      <th>topic_chicken_breast_boneless</th>\n",
                            "      <th>topic_ground_black_pepper</th>\n",
                            "      <th>topic_chili_seed_powder</th>\n",
                            "      <th>topic_sugar_egg_vanilla</th>\n",
                            "      <th>topic_oliv_extra_virgin</th>\n",
                            "      <th>topic_white_wine_vinegar</th>\n",
                            "      <th>topic_dri_oregano_tomato</th>\n",
                            "      <th>topic_lemon_juic_fresh</th>\n",
                            "      <th>topic_water_yeast_warm</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>1.438751</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>2.671147</td>\n",
                            "      <td>1.613176</td>\n",
                            "      <td>0.2249</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.507867</td>\n",
                            "      <td>0.0000</td>\n",
                            "      <td>0.00000</td>\n",
                            "      <td>2.561140</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.472827</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>2.447803</td>\n",
                            "      <td>3.044097</td>\n",
                            "      <td>0.176603</td>\n",
                            "      <td>1.411024</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>5.717447</td>\n",
                            "      <td>0.4958</td>\n",
                            "      <td>1.59881</td>\n",
                            "      <td>0.548627</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>1.592754</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.2495</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   topic_pepper_bell_red  topic_flour_purpos_all  topic_sauc_soy_sesam  \\\n",
                            "0               1.438751                0.000000              0.000000   \n",
                            "1               2.447803                3.044097              0.176603   \n",
                            "\n",
                            "   topic_tortilla_cream_shred  topic_chees_parmesan_grate  \\\n",
                            "0                    2.671147                    1.613176   \n",
                            "1                    1.411024                    0.000000   \n",
                            "\n",
                            "   topic_lime_cilantro_fresh  topic_chicken_breast_boneless  \\\n",
                            "0                     0.2249                            0.0   \n",
                            "1                     0.0000                            0.0   \n",
                            "\n",
                            "   topic_ground_black_pepper  topic_chili_seed_powder  \\\n",
                            "0                   0.507867                   0.0000   \n",
                            "1                   5.717447                   0.4958   \n",
                            "\n",
                            "   topic_sugar_egg_vanilla  topic_oliv_extra_virgin  topic_white_wine_vinegar  \\\n",
                            "0                  0.00000                 2.561140                       0.0   \n",
                            "1                  1.59881                 0.548627                       0.0   \n",
                            "\n",
                            "   topic_dri_oregano_tomato  topic_lemon_juic_fresh  topic_water_yeast_warm  \n",
                            "0                  0.472827                     0.0                  0.0000  \n",
                            "1                  1.592754                     0.0                  0.2495  "
                        ]
                    },
                    "execution_count": 324,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "# If we don't want 'real' names for the topics, we can run this line\n",
                "# topic_list = [f\"topic_{i}\" for i in range(model.n_components_)]\n",
                "\n",
                "# Convert our counts into numbers\n",
                "amounts = model.transform(matrix) * 100\n",
                "\n",
                "# Set it up as a dataframe\n",
                "topics = pd.DataFrame(amounts, columns=topic_list)\n",
                "topics.head(2)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Our first recipe is primary `topic_3` with a rating of 2.44, but it's also a bit topic 0 and topic 8 with scores of 1.5 and 1.36.\n",
                "\n",
                "Our second recipe is a bit bolder - it scores a whopping 5.7 in `topic_7`, with 0, 8 and 14 coming up in the 2.5-3 range.\n",
                "\n",
                "Let's combine this topics dataframe with our **original dataframe** so we can see it all in one place."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 325,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>cuisine</th>\n",
                            "      <th>id</th>\n",
                            "      <th>ingredient_list</th>\n",
                            "      <th>topic_pepper_bell_red</th>\n",
                            "      <th>topic_flour_purpos_all</th>\n",
                            "      <th>topic_sauc_soy_sesam</th>\n",
                            "      <th>topic_tortilla_cream_shred</th>\n",
                            "      <th>topic_chees_parmesan_grate</th>\n",
                            "      <th>topic_lime_cilantro_fresh</th>\n",
                            "      <th>topic_chicken_breast_boneless</th>\n",
                            "      <th>topic_ground_black_pepper</th>\n",
                            "      <th>topic_chili_seed_powder</th>\n",
                            "      <th>topic_sugar_egg_vanilla</th>\n",
                            "      <th>topic_oliv_extra_virgin</th>\n",
                            "      <th>topic_white_wine_vinegar</th>\n",
                            "      <th>topic_dri_oregano_tomato</th>\n",
                            "      <th>topic_lemon_juic_fresh</th>\n",
                            "      <th>topic_water_yeast_warm</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>greek</td>\n",
                            "      <td>10259</td>\n",
                            "      <td>romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo...</td>\n",
                            "      <td>1.438751</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>2.671147</td>\n",
                            "      <td>1.613176</td>\n",
                            "      <td>0.2249</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.507867</td>\n",
                            "      <td>0.0000</td>\n",
                            "      <td>0.00000</td>\n",
                            "      <td>2.561140</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.472827</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>southern_us</td>\n",
                            "      <td>25693</td>\n",
                            "      <td>plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, ye...</td>\n",
                            "      <td>2.447803</td>\n",
                            "      <td>3.044097</td>\n",
                            "      <td>0.176603</td>\n",
                            "      <td>1.411024</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>5.717447</td>\n",
                            "      <td>0.4958</td>\n",
                            "      <td>1.59881</td>\n",
                            "      <td>0.548627</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>1.592754</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.2495</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "       cuisine     id  \\\n",
                            "0        greek  10259   \n",
                            "1  southern_us  25693   \n",
                            "\n",
                            "                                                                                       ingredient_list  \\\n",
                            "0  romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo...   \n",
                            "1  plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, ye...   \n",
                            "\n",
                            "   topic_pepper_bell_red  topic_flour_purpos_all  topic_sauc_soy_sesam  \\\n",
                            "0               1.438751                0.000000              0.000000   \n",
                            "1               2.447803                3.044097              0.176603   \n",
                            "\n",
                            "   topic_tortilla_cream_shred  topic_chees_parmesan_grate  \\\n",
                            "0                    2.671147                    1.613176   \n",
                            "1                    1.411024                    0.000000   \n",
                            "\n",
                            "   topic_lime_cilantro_fresh  topic_chicken_breast_boneless  \\\n",
                            "0                     0.2249                            0.0   \n",
                            "1                     0.0000                            0.0   \n",
                            "\n",
                            "   topic_ground_black_pepper  topic_chili_seed_powder  \\\n",
                            "0                   0.507867                   0.0000   \n",
                            "1                   5.717447                   0.4958   \n",
                            "\n",
                            "   topic_sugar_egg_vanilla  topic_oliv_extra_virgin  topic_white_wine_vinegar  \\\n",
                            "0                  0.00000                 2.561140                       0.0   \n",
                            "1                  1.59881                 0.548627                       0.0   \n",
                            "\n",
                            "   topic_dri_oregano_tomato  topic_lemon_juic_fresh  topic_water_yeast_warm  \n",
                            "0                  0.472827                     0.0                  0.0000  \n",
                            "1                  1.592754                     0.0                  0.2495  "
                        ]
                    },
                    "execution_count": 325,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "merged = recipes.merge(topics, right_index=True, left_index=True)\n",
                "merged.head(2)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Now we can do things like...\n",
                "\n",
                "* Uncover possible topics discussed in the dataset\n",
                "* See how many documents cover each topic\n",
                "* Find the top documents in each topic\n",
                "\n",
                "And **graph it!** Let's see what our distribution of topics looks like."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 340,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "<matplotlib.legend.Legend at 0x16c0f15f8>"
                        ]
                    },
                    "execution_count": 340,
                    "metadata": {},
                    "output_type": "execute_result"
                },
                {
                    "data": {
                        "image/png": "\n",
                        "text/plain": [
                            "<Figure size 432x288 with 1 Axes>"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "ax = merged[topic_list].sum().to_frame().T.plot(kind='barh', stacked=True)\n",
                "\n",
                "# Move the legend off of the chart\n",
                "ax.legend(loc=(1.04,0))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Suspiciously even, but that's an investigation for another day. Let's try a different dataset that splits a little differently."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Attempt two: State of the Union addresses\n",
                "\n",
                "One of the fun things to do with topic modeling is see how **things change over time.** For this example, we're going to reproduce an assignment from [Jonathan Stray's Computational Journalism course](http://www.compjournalism.com/?p=208).\n",
                "\n",
                "At the beginning of each year, the President of the United States traditionally addresses Congress in a speech called the State of the Union. It's a good way to judge what's important in the country at the time, because the speech is sure to be used as a platform to address the legislative agenda for the year. Let's see if topic modeling can help illustrate how it's **changed over time.**"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Our data\n",
                "\n",
                "We have a simple CSV of State of the Union addresses, nothing too crazy."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 357,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>year</th>\n",
                            "      <th>content</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>60</th>\n",
                            "      <td>1849</td>\n",
                            "      <td>\\nState of the Union Address\\nZachary Taylor\\nDecember 4, 1849\\n\\nFellow-Citizens of the Senate ...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>94</th>\n",
                            "      <td>1883</td>\n",
                            "      <td>\\nState of the Union Address\\nChester A. Arthur\\nDecember 4, 1883\\n\\nTo the Congress of the Unit...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>63</th>\n",
                            "      <td>1852</td>\n",
                            "      <td>\\nState of the Union Address\\nMillard Fillmore\\nDecember 6, 1852\\n\\nFellow-Citizens of the Senat...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>148</th>\n",
                            "      <td>1938</td>\n",
                            "      <td>\\nState of the Union Address\\nFranklin D. Roosevelt\\nJanuary 3, 1938\\n\\nMr. President, Mr. Speak...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>156</th>\n",
                            "      <td>1946</td>\n",
                            "      <td>\\nState of the Union Address\\nHarry S. Truman\\nJanuary 21, 1946\\n\\nTo the Congress of the United...</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "     year  \\\n",
                            "60   1849   \n",
                            "94   1883   \n",
                            "63   1852   \n",
                            "148  1938   \n",
                            "156  1946   \n",
                            "\n",
                            "                                                                                                 content  \n",
                            "60   \\nState of the Union Address\\nZachary Taylor\\nDecember 4, 1849\\n\\nFellow-Citizens of the Senate ...  \n",
                            "94   \\nState of the Union Address\\nChester A. Arthur\\nDecember 4, 1883\\n\\nTo the Congress of the Unit...  \n",
                            "63   \\nState of the Union Address\\nMillard Fillmore\\nDecember 6, 1852\\n\\nFellow-Citizens of the Senat...  \n",
                            "148  \\nState of the Union Address\\nFranklin D. Roosevelt\\nJanuary 3, 1938\\n\\nMr. President, Mr. Speak...  \n",
                            "156  \\nState of the Union Address\\nHarry S. Truman\\nJanuary 21, 1946\\n\\nTo the Congress of the United...  "
                        ]
                    },
                    "execution_count": 357,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "speeches = pd.read_csv(\"data/state-of-the-union.csv\")\n",
                "speeches.sample(5)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "It's not too many, only a little over 226. Because it's a smaller dataset, we're able to do more computationally intensive forms of topic modeling (LDA, for example) without sitting around getting bored."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 358,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "(226, 2)"
                        ]
                    },
                    "execution_count": 358,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "speeches.shape"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "To help the analysis out a bit, we're going to clean the text. Only a little bit, though - we'll just remove anything that isn't a word."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 359,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>year</th>\n",
                            "      <th>content</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>1790</td>\n",
                            "      <td>George WashingtonJanuary        Fellow Citizens of the Senate and House of Representatives I emb...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1790</td>\n",
                            "      <td>State of the Union AddressGeorge WashingtonDecember        Fellow Citizens of the Senate and Hou...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>1791</td>\n",
                            "      <td>State of the Union AddressGeorge WashingtonOctober         Fellow Citizens of the Senate and Hou...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>1792</td>\n",
                            "      <td>State of the Union AddressGeorge WashingtonNovember        Fellow Citizens of the Senate and Hou...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>1793</td>\n",
                            "      <td>State of the Union AddressGeorge WashingtonDecember        Fellow Citizens of the Senate and Hou...</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   year  \\\n",
                            "0  1790   \n",
                            "1  1790   \n",
                            "2  1791   \n",
                            "3  1792   \n",
                            "4  1793   \n",
                            "\n",
                            "                                                                                               content  \n",
                            "0  George WashingtonJanuary        Fellow Citizens of the Senate and House of Representatives I emb...  \n",
                            "1  State of the Union AddressGeorge WashingtonDecember        Fellow Citizens of the Senate and Hou...  \n",
                            "2  State of the Union AddressGeorge WashingtonOctober         Fellow Citizens of the Senate and Hou...  \n",
                            "3  State of the Union AddressGeorge WashingtonNovember        Fellow Citizens of the Senate and Hou...  \n",
                            "4  State of the Union AddressGeorge WashingtonDecember        Fellow Citizens of the Senate and Hou...  "
                        ]
                    },
                    "execution_count": 359,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "# Remove non-word characters, so numbers and ___ etc\n",
                "speeches.content = speeches.content.str.replace(\"[^A-Za-z ]\", \" \")\n",
                "speeches.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Vectorize\n",
                "\n",
                "We're going to use the same TF-IDF vectorizer we used up above, which stems in addition to just vectorizing. We'll reproduce the code down here for completeness's sake (and easy cut-and-paste)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 360,
            "metadata": {},
            "outputs": [],
            "source": [
                "from sklearn.feature_extraction.text import TfidfVectorizer\n",
                "import Stemmer\n",
                "\n",
                "# English stemmer from pyStemmer\n",
                "stemmer = Stemmer.Stemmer('en')\n",
                "\n",
                "analyzer = TfidfVectorizer().build_analyzer()\n",
                "\n",
                "# Override TfidfVectorizer\n",
                "class StemmedTfidfVectorizer(TfidfVectorizer):\n",
                "    def build_analyzer(self):\n",
                "        analyzer = super(TfidfVectorizer, self).build_analyzer()\n",
                "        return lambda doc: stemmer.stemWords(analyzer(doc))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "With our first pass we'll **vectorize everything**, no limits!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 361,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>aaa</th>\n",
                            "      <th>aana</th>\n",
                            "      <th>aaron</th>\n",
                            "      <th>abail</th>\n",
                            "      <th>abal</th>\n",
                            "      <th>abalanc</th>\n",
                            "      <th>abandon</th>\n",
                            "      <th>abandonedbi</th>\n",
                            "      <th>abandonednow</th>\n",
                            "      <th>abandonedtheir</th>\n",
                            "      <th>...</th>\n",
                            "      <th>zimbabwean</th>\n",
                            "      <th>zinc</th>\n",
                            "      <th>zionchurch</th>\n",
                            "      <th>zollverein</th>\n",
                            "      <th>zone</th>\n",
                            "      <th>zoneof</th>\n",
                            "      <th>zonesin</th>\n",
                            "      <th>zoolog</th>\n",
                            "      <th>zoom</th>\n",
                            "      <th>zuloaga</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "<p>5 rows \u00d7 73222 columns</p>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   aaa  aana  aaron  abail  abal  abalanc  abandon  abandonedbi  abandonednow  \\\n",
                            "0  0.0   0.0    0.0    0.0   0.0      0.0      0.0          0.0           0.0   \n",
                            "1  0.0   0.0    0.0    0.0   0.0      0.0      0.0          0.0           0.0   \n",
                            "2  0.0   0.0    0.0    0.0   0.0      0.0      0.0          0.0           0.0   \n",
                            "3  0.0   0.0    0.0    0.0   0.0      0.0      0.0          0.0           0.0   \n",
                            "4  0.0   0.0    0.0    0.0   0.0      0.0      0.0          0.0           0.0   \n",
                            "\n",
                            "   abandonedtheir  ...  zimbabwean  zinc  zionchurch  zollverein  zone  \\\n",
                            "0             0.0  ...         0.0   0.0         0.0         0.0   0.0   \n",
                            "1             0.0  ...         0.0   0.0         0.0         0.0   0.0   \n",
                            "2             0.0  ...         0.0   0.0         0.0         0.0   0.0   \n",
                            "3             0.0  ...         0.0   0.0         0.0         0.0   0.0   \n",
                            "4             0.0  ...         0.0   0.0         0.0         0.0   0.0   \n",
                            "\n",
                            "   zoneof  zonesin  zoolog  zoom  zuloaga  \n",
                            "0     0.0      0.0     0.0   0.0      0.0  \n",
                            "1     0.0      0.0     0.0   0.0      0.0  \n",
                            "2     0.0      0.0     0.0   0.0      0.0  \n",
                            "3     0.0      0.0     0.0   0.0      0.0  \n",
                            "4     0.0      0.0     0.0   0.0      0.0  \n",
                            "\n",
                            "[5 rows x 73222 columns]"
                        ]
                    },
                    "execution_count": 361,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "vectorizer = StemmedTfidfVectorizer(stop_words='english')\n",
                "matrix = vectorizer.fit_transform(speeches.content)\n",
                "\n",
                "words_df = pd.DataFrame(matrix.toarray(),\n",
                "                        columns=vectorizer.get_feature_names())\n",
                "words_df.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Running NME/NMF topic modeling\n",
                "\n",
                "Now we'll leap into topic modeling. We'll look at fifteen topics, since we're covering a long span of time where lots of different things may have happened."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 362,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Topic 0: state govern mexico unit congress constitut territori power countri act\n",
                        "Topic 1: program year nation billion new feder congress help american goal\n",
                        "Topic 2: america tonight year ve american job work let peopl help\n",
                        "Topic 3: war enemi fight victori nation japanes hitler american forc men\n",
                        "Topic 4: law nation govern state work corpor men great man peopl\n",
                        "Topic 5: state unit nation public govern congress year great commerc million\n",
                        "Topic 6: terrorist iraq america iraqi terror tonight american al help regim\n",
                        "Topic 7: soviet world nation free defens communist peac econom militari freedom\n",
                        "Topic 8: govern state bank public subject countri duti peopl treasuri general\n",
                        "Topic 9: year state govern unit report law congress cent silver increas\n",
                        "Topic 10: govern state unit treati countri congress american relat convent spain\n",
                        "Topic 11: nation world govern peopl democraci today problem state peac congress\n",
                        "Topic 12: gentlemen state unit nation public law measur indian object repres\n",
                        "Topic 13: shall matter men govern peopl countri thought great make necessari\n",
                        "Topic 14: govern agricultur year countri feder nation public industri econom bank\n"
                    ]
                }
            ],
            "source": [
                "model = NMF(n_components=15)\n",
                "model.fit(matrix)\n",
                "\n",
                "n_words = 10\n",
                "feature_names = vectorizer.get_feature_names()\n",
                "\n",
                "topic_list = []\n",
                "for topic_idx, topic in enumerate(model.components_):\n",
                "    top_n = [feature_names[i]\n",
                "             for i in topic.argsort()\n",
                "             [-n_words:]][::-1]\n",
                "    top_features = ' '.join(top_n)\n",
                "    topic_list.append(f\"topic_{'_'.join(top_n[:3])}\") \n",
                "\n",
                "    print(f\"Topic {topic_idx}: {top_features}\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Let's be honest with ourselves: **we expected something a bit better.** So many of these words are so _common_ that it doesn't do much to convince me these are meaningful concepts."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Adjusting our min and max document frequency\n",
                "\n",
                "One way to cut those overly broad topics from our topic model is to **remove them from the vectorizer.** Instead of accepting _all_ words, we can set minimum or maximum limits.\n",
                "\n",
                "Let's only accept words used **in at least 5 speeches**, but also **don't appear in more than half of the speeches**."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 363,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>abal</th>\n",
                            "      <th>abalanc</th>\n",
                            "      <th>abandon</th>\n",
                            "      <th>abat</th>\n",
                            "      <th>abdic</th>\n",
                            "      <th>abett</th>\n",
                            "      <th>abey</th>\n",
                            "      <th>abhorr</th>\n",
                            "      <th>abid</th>\n",
                            "      <th>abil</th>\n",
                            "      <th>...</th>\n",
                            "      <th>yourfavor</th>\n",
                            "      <th>youth</th>\n",
                            "      <th>youto</th>\n",
                            "      <th>youwil</th>\n",
                            "      <th>yukon</th>\n",
                            "      <th>zeal</th>\n",
                            "      <th>zealand</th>\n",
                            "      <th>zealous</th>\n",
                            "      <th>zero</th>\n",
                            "      <th>zone</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.00000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.082383</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.00000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.00000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.061166</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.071243</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.04508</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.062934</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>...</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.00000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "<p>5 rows \u00d7 8272 columns</p>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   abal  abalanc  abandon      abat  abdic  abett  abey  abhorr  abid  abil  \\\n",
                            "0   0.0      0.0      0.0  0.000000    0.0    0.0   0.0     0.0   0.0   0.0   \n",
                            "1   0.0      0.0      0.0  0.000000    0.0    0.0   0.0     0.0   0.0   0.0   \n",
                            "2   0.0      0.0      0.0  0.000000    0.0    0.0   0.0     0.0   0.0   0.0   \n",
                            "3   0.0      0.0      0.0  0.071243    0.0    0.0   0.0     0.0   0.0   0.0   \n",
                            "4   0.0      0.0      0.0  0.000000    0.0    0.0   0.0     0.0   0.0   0.0   \n",
                            "\n",
                            "   ...  yourfavor  youth  youto    youwil  yukon     zeal  zealand   zealous  \\\n",
                            "0  ...        0.0    0.0    0.0  0.000000    0.0  0.00000      0.0  0.000000   \n",
                            "1  ...        0.0    0.0    0.0  0.082383    0.0  0.00000      0.0  0.000000   \n",
                            "2  ...        0.0    0.0    0.0  0.000000    0.0  0.00000      0.0  0.061166   \n",
                            "3  ...        0.0    0.0    0.0  0.000000    0.0  0.04508      0.0  0.062934   \n",
                            "4  ...        0.0    0.0    0.0  0.000000    0.0  0.00000      0.0  0.000000   \n",
                            "\n",
                            "   zero  zone  \n",
                            "0   0.0   0.0  \n",
                            "1   0.0   0.0  \n",
                            "2   0.0   0.0  \n",
                            "3   0.0   0.0  \n",
                            "4   0.0   0.0  \n",
                            "\n",
                            "[5 rows x 8272 columns]"
                        ]
                    },
                    "execution_count": 363,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "vectorizer = StemmedTfidfVectorizer(stop_words='english', min_df=5, max_df=0.5)\n",
                "matrix = vectorizer.fit_transform(speeches.content)\n",
                "\n",
                "words_df = pd.DataFrame(matrix.toarray(),\n",
                "                        columns=vectorizer.get_feature_names())\n",
                "words_df.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "And now we'll **check the topic model.**"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 364,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Topic 0: island cuba arbitr june spain pension award confer consular commission\n",
                        "Topic 1: program billion today goal budget achiev area level farm percent\n",
                        "Topic 2: ve job tonight budget cut ll school spend don deficit\n",
                        "Topic 3: spain coloni franc articl intercours tribe minist port navig commenc\n",
                        "Topic 4: terrorist iraq iraqi terror tonight al regim afghanistan qaeda fight\n",
                        "Topic 5: fight enemi japanes today democraci victori tank plane task attack\n",
                        "Topic 6: mexico texa mexican oregon california annex minist articl steamer loan\n",
                        "Topic 7: method relief cent veteran board farmer farm tariff depress committe\n",
                        "Topic 8: silver gold currenc note circul coinag cent bond coin speci\n",
                        "Topic 9: soviet communist atom threat aggress ve missil korea weapon ii\n",
                        "Topic 10: militia british enemi council tribe whilst decre port regular neutral\n",
                        "Topic 11: gentlemen commission amiti satisfact articl burthen militia prospect majesti hostil\n",
                        "Topic 12: corpor interst forest island philippin railroad deal class supervis bodi\n",
                        "Topic 13: kansa slave slaveri june rebellion whilst commenc theconstitut minist south\n",
                        "Topic 14: vietnam tonight billion budget program tri percent goal crime poverti\n"
                    ]
                }
            ],
            "source": [
                "model = NMF(n_components=15)\n",
                "model.fit(matrix)\n",
                "\n",
                "n_words = 10\n",
                "feature_names = vectorizer.get_feature_names()\n",
                "\n",
                "topic_list = []\n",
                "for topic_idx, topic in enumerate(model.components_):\n",
                "    top_n = [feature_names[i]\n",
                "             for i in topic.argsort()\n",
                "             [-n_words:]][::-1]\n",
                "    top_features = ' '.join(top_n)\n",
                "    topic_list.append(f\"topic_{'_'.join(top_n[:3])}\") \n",
                "\n",
                "    print(f\"Topic {topic_idx}: {top_features}\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "That's looking a little more interesting! Lots of references to wars and political conflict, along with slavery and monetary policy."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Visualizing the outcome\n",
                "\n",
                "We can get a better handle on what our data looks like through a little visualization. We'll start by loading up the **topic popularity dataframe**. Remember that each row is one of our speeches."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 365,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>topic_island_cuba_arbitr</th>\n",
                            "      <th>topic_program_billion_today</th>\n",
                            "      <th>topic_ve_job_tonight</th>\n",
                            "      <th>topic_spain_coloni_franc</th>\n",
                            "      <th>topic_terrorist_iraq_iraqi</th>\n",
                            "      <th>topic_fight_enemi_japanes</th>\n",
                            "      <th>topic_mexico_texa_mexican</th>\n",
                            "      <th>topic_method_relief_cent</th>\n",
                            "      <th>topic_silver_gold_currenc</th>\n",
                            "      <th>topic_soviet_communist_atom</th>\n",
                            "      <th>topic_militia_british_enemi</th>\n",
                            "      <th>topic_gentlemen_commission_amiti</th>\n",
                            "      <th>topic_corpor_interst_forest</th>\n",
                            "      <th>topic_kansa_slave_slaveri</th>\n",
                            "      <th>topic_vietnam_tonight_billion</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>2.404538</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.116872</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.761927</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>33.359101</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.138286</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.000000</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>13.178591</td>\n",
                            "      <td>25.444151</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>0.0</td>\n",
                            "      <td>1.364718</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   topic_island_cuba_arbitr  topic_program_billion_today  \\\n",
                            "0                       0.0                          0.0   \n",
                            "1                       0.0                          0.0   \n",
                            "\n",
                            "   topic_ve_job_tonight  topic_spain_coloni_franc  topic_terrorist_iraq_iraqi  \\\n",
                            "0                   0.0                  2.404538                         0.0   \n",
                            "1                   0.0                  0.000000                         0.0   \n",
                            "\n",
                            "   topic_fight_enemi_japanes  topic_mexico_texa_mexican  \\\n",
                            "0                   0.116872                        0.0   \n",
                            "1                   0.000000                        0.0   \n",
                            "\n",
                            "   topic_method_relief_cent  topic_silver_gold_currenc  \\\n",
                            "0                       0.0                   0.761927   \n",
                            "1                       0.0                   0.000000   \n",
                            "\n",
                            "   topic_soviet_communist_atom  topic_militia_british_enemi  \\\n",
                            "0                          0.0                     0.000000   \n",
                            "1                          0.0                    13.178591   \n",
                            "\n",
                            "   topic_gentlemen_commission_amiti  topic_corpor_interst_forest  \\\n",
                            "0                         33.359101                          0.0   \n",
                            "1                         25.444151                          0.0   \n",
                            "\n",
                            "   topic_kansa_slave_slaveri  topic_vietnam_tonight_billion  \n",
                            "0                        0.0                       0.138286  \n",
                            "1                        0.0                       1.364718  "
                        ]
                    },
                    "execution_count": 365,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "# Convert our counts into numbers\n",
                "amounts = model.transform(matrix) * 100\n",
                "\n",
                "# Set it up as a dataframe\n",
                "topics = pd.DataFrame(amounts, columns=topic_list)\n",
                "topics.head(2)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The first row is our first speech, the second row is our second speech, and so on."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 366,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "<matplotlib.legend.Legend at 0x1b3f15668>"
                        ]
                    },
                    "execution_count": 366,
                    "metadata": {},
                    "output_type": "execute_result"
                },
                {
                    "data": {
                        "image/png": "\n",
                        "text/plain": [
                            "<Figure size 432x288 with 1 Axes>"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "ax = topics.sum().to_frame().T.plot(kind='barh', stacked=True)\n",
                "\n",
                "# Move the legend off of the chart\n",
                "ax.legend(loc=(1.04,0))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Again, pretty even! A few are larger or smaller, but overall the topics seem pretty evenly distributed.\n",
                "\n",
                "Looking at things over all time doesn't mean much, though, we're interseted in **change over time**.\n",
                "\n",
                "The hip way to do this is with a **streamgraph**, which is a stacked area graph that centers on the vertical axis. Usually you'd have to merge the two dataframes in order to graph, but we can sneakily get around it since we aren't plotting with pandas (plotting streamgraphs requires directly talking to matplotlib)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 367,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "<matplotlib.legend.Legend at 0x1d088d6d8>"
                        ]
                    },
                    "execution_count": 367,
                    "metadata": {},
                    "output_type": "execute_result"
                },
                {
                    "data": {
                        "image/png": "\n",
                        "text/plain": [
                            "<Figure size 720x360 with 1 Axes>"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "x_axis = speeches.year\n",
                "y_axis = topics\n",
                "\n",
                "fig, ax = plt.subplots(figsize=(10,5))\n",
                "\n",
                "# Plot a stackplot - https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/stackplot_demo.html\n",
                "ax.stackplot(x_axis, y_axis.T, baseline='wiggle', labels=y_axis.columns)\n",
                "\n",
                "# Move the legend off of the chart\n",
                "ax.legend(loc=(1.04,0))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "I know that \"Presidents talk about current news topics\" is probably not the most exciting things you've ever seen, but you can watch things rise and fall easily enough."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 368,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "<matplotlib.legend.Legend at 0x1cf975c18>"
                        ]
                    },
                    "execution_count": 368,
                    "metadata": {},
                    "output_type": "execute_result"
                },
                {
                    "data": {
                        "image/png": "\n",
                        "text/plain": [
                            "<Figure size 720x216 with 1 Axes>"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "merged = topics.join(speeches)\n",
                "\n",
                "ax = merged.plot(x='year', y=['topic_kansa_slave_slaveri', 'topic_soviet_communist_atom'], figsize=(10,3))\n",
                "ax.legend(loc=(1.04,0))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## So what do you do with this?\n",
                "\n",
                "Good question. TODO.\n",
                "\n",
                "## Review\n",
                "\n",
                "In this section we looked at **topic modeling**, a technique of extracting topics out of text datasets. Unlike clustering, where each document is assigned one category, in topic modeling **each document is considered blend of different topics.**\n",
                "\n",
                "You don't need to \"teach\" a topic model anything about your dataset, you just let it loose and it comes back with what terms represent each topic. The only thing you need to give it is the **number of topics to find**.\n",
                "\n",
                "The way you preprocess the text is very important to a topic model. We found that common words ended up appearing in many topics unless we used `max_df=` in our vectorizer to filter out high-frequency words.\n",
                "\n",
                "There are many different algorithms to use for topic modeling, but we're saving that for a later section.\n",
                "\n",
                "## Discussion topics\n",
                "\n",
                "TODO"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.6.8"
        },
        "toc": {
            "base_numbering": 1,
            "nav_menu": {},
            "number_sections": true,
            "sideBar": true,
            "skip_h1_title": false,
            "title_cell": "Table of Contents",
            "title_sidebar": "Contents",
            "toc_cell": false,
            "toc_position": {},
            "toc_section_display": true,
            "toc_window_display": false
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}