Text analysis snippets

Python data science coding reference from investigate.ai

Reading in files

Reading in one file

Reading in one file, nice and easy

content = open("filename.txt").read()

Reading in multiple files

This will give you a dataframe with two columns - one with the filename, the other with the contents of the file.

It also uses glob to pattern-match - this will read in all filenames that end in .txt in the current folder.

import glob
import pandas as pd

filenames = glob.glob("*.txt")
contents = [open(filename).read() for filename in filenames]
df = pd.DataFrame({
  'filename': filenames,
  'content': contents

Topic modeling

NME/NMF with sklearn

from sklearn.decomposition import NMF

model = NMF(n_components=5)

LDA with sklearn

from sklearn.decomposition import LatentDirichletAllocation

model = LatentDirichletAllocation(n_components=5,

LSA/LSI with sklearn

from sklearn.decomposition import TruncatedSVD

model = TruncatedSVD(n_components=5)

Find best options for LDA

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV

# Options to try with our LDA
# Beware it will try *all* of the combinations, so it'll take ages
search_params = {
  'n_components': [5, 7, 10, 15],
  'learning_decay': [.5, .7, .9]

# Set up LDA with the options we'll keep static
model = LatentDirichletAllocation(learning_method='online')

# Try all of the options
gridsearch = GridSearchCV(model, param_grid=search_params, cv=5, n_jobs=-1, verbose=1)

# What did we find?
print("Best Model's Params: ", gridsearch.best_params_)
print("Best Log Likelihood Score: ", gridsearch.best_score_)

Topic terms for topic models

n_words = 5
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(model.components_):
    message = "Topic #%d: " % topic_idx
    message += " ".join([feature_names[i]
                         for i in topic.argsort()[:-n_words - 1:-1]])



from sklearn.cluster import KMeans

km = KMeans(n_clusters=5)
df['prediction'] = km.predict()

Top cluster terms

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))