Scraping app store reviews#
In the Washington Post's project, they found a "secret API" that allowed them to download all the App Store reviews of target "random chat apps." We're going to download reviews using the marketing platform Sensor Tower instead. Our target apps will be Chat with Strangers, Yubo, Holla, and Skout.
Their reviews section doesn't have a download button, so we use a Selenium web scraper to download the information instead.
from bs4 import BeautifulSoup
import pandas as pd
import time
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://sensortower.com/ios/US/twelve-app/app/yubo-make-new-friends/1038653883/review-history?selected_tab=reviews')
Select your options and scrape#
After you log in, select the following options to make sure you're only scraping US-based reviews. This is mostly to make sure we keep everything in English, as we won't be able to manually find racism etc in non-English reviews.
- Date: All time
- Country: US
def get_page():
doc = BeautifulSoup(driver.page_source)
rows = doc.select("tbody tr")
datapoints = []
for row in rows:
cells = row.select("td")
data = {
'Country': cells[0].text.strip(),
'Date': cells[1].text.strip(),
'Rating': cells[2].select_one('.gold')['style'],
'Review': cells[3].select_one('.break-wrap-review').text.strip(),
'Version': cells[4].text.strip()
}
datapoints.append(data)
return datapoints
all_data = []
wait = WebDriverWait(driver, 5, poll_frequency=0.05)
while True:
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))
results = get_page()
all_data.extend(results)
next_button = driver.find_elements_by_css_selector(".btn-group .pagination")[1]
if next_button.get_attribute('disabled'):
break
next_button.click()
time.sleep(0.5)
# Doesn't trigger fast enough!
# wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))
df = pd.DataFrame(all_data)
df
# You'll change this filename for each app you're storing reviews for
df.to_csv("data/chat-for-strangers.csv", index=False)
Combine and add columns#
Once we've saved reviews for several different apps, we're ready to go. We'll combine them all into one single file and add a note about what app each review came from.
holla = pd.read_csv('data/holla.csv')
holla['source'] = 'holla'
yubo = pd.read_csv('data/yubo.csv')
yubo['source'] = 'yubo'
skout = pd.read_csv('data/skout.csv')
skout['source'] = 'skout'
strangers = pd.read_csv('data/chat-for-strangers.csv')
strangers['source'] = 'chat-for-strangers'
df = pd.concat([holla, yubo, skout, strangers], ignore_index=True)
df.shape
df.source.value_counts()
We'll also add columns for racism, bullying, and unwanted sexual behavior. While we don't know which reviews contain this content yet, we'll use these columns to mark it in Excel or Google Sheets later.
# Using a machine learning algorithm to identify App Store reviews
# containing reports of unwanted sexual content, racism and bullying...
df['racism'] = np.nan
df['bullying'] = np.nan
df['sexual'] = np.nan
df.head()
Clean up the rating#
We don't have ratings that are numeric! Let's convert the weird HTML star percentage to actual numbers.
df.Rating.value_counts()
df.Rating = df.Rating.replace({
'width: 99%;': 5,
'width: 79%;': 4,
'width: 59%;': 3,
'width: 39%;': 2,
'width: 19%;': 1
})
df.head()
df.Rating.value_counts()
df.to_csv("data/reviews.csv", index=False)
Review#
Instead of asking Apple or finding a secret API like the Washington Post, we used an app marketing site to find App Store reviews of the apps we were interested in. They didn't have a download button, though, so we wrote a simple scraper to pull them down.
After obtaining the reviews, cleaned them a bit and we combined them into one spreadsheet and added columns for racism, bullying, and unwanted sexual behavior that we'll fill in later manually.
Discussion topics#
Is pulling data from a secondary source okay?
How do we know that they list all available reviews on the site that we obtained the reviews from?
Do we need all of the reviews, or could we have filtered them at this point to narrow our field down?