Scraping app store reviews#

In the Washington Post's project, they found a "secret API" that allowed them to download all the App Store reviews of target "random chat apps." We're going to download reviews using the marketing platform Sensor Tower instead. Our target apps will be Chat with Strangers, Yubo, Holla, and Skout.

Their reviews section doesn't have a download button, so we use a Selenium web scraper to download the information instead.

from bs4 import BeautifulSoup
import pandas as pd
import time
import numpy as np

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://sensortower.com/ios/US/twelve-app/app/yubo-make-new-friends/1038653883/review-history?selected_tab=reviews')

Select your options and scrape#

After you log in, select the following options to make sure you're only scraping US-based reviews. This is mostly to make sure we keep everything in English, as we won't be able to manually find racism etc in non-English reviews.

  • Date: All time
  • Country: US
def get_page():
    doc = BeautifulSoup(driver.page_source)
    rows = doc.select("tbody tr")

    datapoints = []
    for row in rows:
        cells = row.select("td")
        data = {
            'Country': cells[0].text.strip(),
            'Date': cells[1].text.strip(),
            'Rating': cells[2].select_one('.gold')['style'],
            'Review': cells[3].select_one('.break-wrap-review').text.strip(),
            'Version': cells[4].text.strip()
        }
        datapoints.append(data)
    return datapoints

all_data = []
wait = WebDriverWait(driver, 5, poll_frequency=0.05)
while True:
    wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))

    results = get_page()    
    all_data.extend(results)

    next_button = driver.find_elements_by_css_selector(".btn-group .pagination")[1]
    if next_button.get_attribute('disabled'):
        break
    next_button.click()
    time.sleep(0.5)
    # Doesn't trigger fast enough!
    # wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))

df = pd.DataFrame(all_data)
df
Country Date Rating Review Version
0 US 11/19/2019 width: 19%; This is an Omegle knockoff. Don’t recommend. 9... -
1 US 11/03/2019 width: 99%; So much fun 4.3.9
2 US 10/31/2019 width: 19%; No woman 4.3.9
3 US 10/31/2019 width: 79%; My camera is still not working 4.3.9
4 US 10/25/2019 width: 19%; Cam broke with new iOS update just green lines 4.3.8
... ... ... ... ... ...
3179 US 07/19/2011 width: 99%; Fun app glad I got it for free, would be aweso... 1.0
3180 US 07/19/2011 width: 39%; Love this on iPad, but I'm trying to download ... -
3181 US 07/18/2011 width: 59%; Great but drops convo all tge time :( -
3182 US 07/18/2011 width: 99%; Works just like the service it connects to. -
3183 US 07/16/2011 width: 19%; This app is a waste of money. Connect randomly... -

3184 rows × 5 columns

# You'll change this filename for each app you're storing reviews for
df.to_csv("data/chat-for-strangers.csv", index=False)

Combine and add columns#

Once we've saved reviews for several different apps, we're ready to go. We'll combine them all into one single file and add a note about what app each review came from.

holla = pd.read_csv('data/holla.csv')
holla['source'] = 'holla'

yubo = pd.read_csv('data/yubo.csv')
yubo['source'] = 'yubo'

skout = pd.read_csv('data/skout.csv')
skout['source'] = 'skout'

strangers = pd.read_csv('data/chat-for-strangers.csv')
strangers['source'] = 'chat-for-strangers'
df = pd.concat([holla, yubo, skout, strangers], ignore_index=True)
df.shape
(56056, 6)
df.source.value_counts()
skout                 37484
holla                 10467
yubo                   4921
chat-for-strangers     3184
Name: source, dtype: int64

We'll also add columns for racism, bullying, and unwanted sexual behavior. While we don't know which reviews contain this content yet, we'll use these columns to mark it in Excel or Google Sheets later.

# Using a machine learning algorithm to identify App Store reviews
# containing reports of unwanted sexual content, racism and bullying...
df['racism'] = np.nan
df['bullying'] = np.nan
df['sexual'] = np.nan

df.head()
Country Date Rating Review Version source racism bullying sexual
0 US 11/22/2019 width: 99%; It’s a great app to meet new people and chat i... 4.4.5 holla NaN NaN NaN
1 US 11/22/2019 width: 99%; Holla is an excellent app, where I get to know... 4.4.5 holla NaN NaN NaN
2 US 11/22/2019 width: 19%; This app charges for everything now and is con... - holla NaN NaN NaN
3 US 11/22/2019 width: 99%; Free to use app, meet people around the world. - holla NaN NaN NaN
4 US 11/21/2019 width: 99%; I got this app and everything has been differe... 4.4.5 holla NaN NaN NaN

Clean up the rating#

We don't have ratings that are numeric! Let's convert the weird HTML star percentage to actual numbers.

df.Rating.value_counts()
width: 99%;    32761
width: 19%;     8807
width: 79%;     6418
width: 59%;     4885
width: 39%;     3185
Name: Rating, dtype: int64
df.Rating = df.Rating.replace({
    'width: 99%;': 5,
    'width: 79%;': 4,
    'width: 59%;': 3,
    'width: 39%;': 2,
    'width: 19%;': 1
})
df.head()
Country Date Rating Review Version source
0 US 11/22/2019 5 It’s a great app to meet new people and chat i... 4.4.5 holla
1 US 11/22/2019 5 Holla is an excellent app, where I get to know... 4.4.5 holla
2 US 11/22/2019 1 This app charges for everything now and is con... - holla
3 US 11/22/2019 5 Free to use app, meet people around the world. - holla
4 US 11/21/2019 5 I got this app and everything has been differe... 4.4.5 holla
df.Rating.value_counts()
5    32761
1     8807
4     6418
3     4885
2     3185
Name: Rating, dtype: int64
df.to_csv("data/reviews.csv", index=False)

Review#

Instead of asking Apple or finding a secret API like the Washington Post, we used an app marketing site to find App Store reviews of the apps we were interested in. They didn't have a download button, though, so we wrote a simple scraper to pull them down.

After obtaining the reviews, cleaned them a bit and we combined them into one spreadsheet and added columns for racism, bullying, and unwanted sexual behavior that we'll fill in later manually.

Discussion topics#

Is pulling data from a secondary source okay?

How do we know that they list all available reviews on the site that we obtained the reviews from?

Do we need all of the reviews, or could we have filtered them at this point to narrow our field down?