Scraping app store reviews#

In the Washington Post's project, they found a "secret API" that allowed them to download all the App Store reviews of target "random chat apps." We're going to download reviews using the marketing platform Sensor Tower instead. Our target apps will be Chat with Strangers, Yubo, Holla, and Skout.

Their reviews section doesn't have a download button, so we use a Selenium web scraper to download the information instead.

Read online Download notebook Interactive version

from bs4 import BeautifulSoup
import pandas as pd
import time
import numpy as np

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://sensortower.com/ios/US/twelve-app/app/yubo-make-new-friends/1038653883/review-history?selected_tab=reviews')

Select your options and scrape#

After you log in, select the following options to make sure you're only scraping US-based reviews. This is mostly to make sure we keep everything in English, as we won't be able to manually find racism etc in non-English reviews.

Date: All time
Country: US

def get_page():
    doc = BeautifulSoup(driver.page_source)
    rows = doc.select("tbody tr")

    datapoints = []
    for row in rows:
        cells = row.select("td")
        data = {
            'Country': cells[0].text.strip(),
            'Date': cells[1].text.strip(),
            'Rating': cells[2].select_one('.gold')['style'],
            'Review': cells[3].select_one('.break-wrap-review').text.strip(),
            'Version': cells[4].text.strip()
        }
        datapoints.append(data)
    return datapoints

all_data = []
wait = WebDriverWait(driver, 5, poll_frequency=0.05)
while True:
    wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))

    results = get_page()    
    all_data.extend(results)

    next_button = driver.find_elements_by_css_selector(".btn-group .pagination")[1]
    if next_button.get_attribute('disabled'):
        break
    next_button.click()
    time.sleep(0.5)
    # Doesn't trigger fast enough!
    # wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))

df = pd.DataFrame(all_data)
df

	Country	Date	Rating	Review	Version
0	US	11/19/2019	width: 19%;	This is an Omegle knockoff. Don’t recommend. 9...	-
1	US	11/03/2019	width: 99%;	So much fun	4.3.9
2	US	10/31/2019	width: 19%;	No woman	4.3.9
3	US	10/31/2019	width: 79%;	My camera is still not working	4.3.9
4	US	10/25/2019	width: 19%;	Cam broke with new iOS update just green lines	4.3.8
...	...	...	...	...	...
3179	US	07/19/2011	width: 99%;	Fun app glad I got it for free, would be aweso...	1.0
3180	US	07/19/2011	width: 39%;	Love this on iPad, but I'm trying to download ...	-
3181	US	07/18/2011	width: 59%;	Great but drops convo all tge time :(	-
3182	US	07/18/2011	width: 99%;	Works just like the service it connects to.	-
3183	US	07/16/2011	width: 19%;	This app is a waste of money. Connect randomly...	-

3184 rows × 5 columns

# You'll change this filename for each app you're storing reviews for
df.to_csv("data/chat-for-strangers.csv", index=False)

Combine and add columns#

Once we've saved reviews for several different apps, we're ready to go. We'll combine them all into one single file and add a note about what app each review came from.

holla = pd.read_csv('data/holla.csv')
holla['source'] = 'holla'

yubo = pd.read_csv('data/yubo.csv')
yubo['source'] = 'yubo'

skout = pd.read_csv('data/skout.csv')
skout['source'] = 'skout'

strangers = pd.read_csv('data/chat-for-strangers.csv')
strangers['source'] = 'chat-for-strangers'

df = pd.concat([holla, yubo, skout, strangers], ignore_index=True)
df.shape

(56056, 6)

df.source.value_counts()

skout                 37484
holla                 10467
yubo                   4921
chat-for-strangers     3184
Name: source, dtype: int64

We'll also add columns for racism, bullying, and unwanted sexual behavior. While we don't know which reviews contain this content yet, we'll use these columns to mark it in Excel or Google Sheets later.

# Using a machine learning algorithm to identify App Store reviews
# containing reports of unwanted sexual content, racism and bullying...
df['racism'] = np.nan
df['bullying'] = np.nan
df['sexual'] = np.nan

df.head()

	Country	Date	Rating	Review	Version	source	racism	bullying	sexual
0	US	11/22/2019	width: 99%;	It’s a great app to meet new people and chat i...	4.4.5	holla	NaN	NaN	NaN
1	US	11/22/2019	width: 99%;	Holla is an excellent app, where I get to know...	4.4.5	holla	NaN	NaN	NaN
2	US	11/22/2019	width: 19%;	This app charges for everything now and is con...	-	holla	NaN	NaN	NaN
3	US	11/22/2019	width: 99%;	Free to use app, meet people around the world.	-	holla	NaN	NaN	NaN
4	US	11/21/2019	width: 99%;	I got this app and everything has been differe...	4.4.5	holla	NaN	NaN	NaN

Clean up the rating#

We don't have ratings that are numeric! Let's convert the weird HTML star percentage to actual numbers.

df.Rating.value_counts()

width: 99%;    32761
width: 19%;     8807
width: 79%;     6418
width: 59%;     4885
width: 39%;     3185
Name: Rating, dtype: int64

df.Rating = df.Rating.replace({
    'width: 99%;': 5,
    'width: 79%;': 4,
    'width: 59%;': 3,
    'width: 39%;': 2,
    'width: 19%;': 1
})
df.head()

	Country	Date	Rating	Review	Version	source
0	US	11/22/2019	5	It’s a great app to meet new people and chat i...	4.4.5	holla
1	US	11/22/2019	5	Holla is an excellent app, where I get to know...	4.4.5	holla
2	US	11/22/2019	1	This app charges for everything now and is con...	-	holla
3	US	11/22/2019	5	Free to use app, meet people around the world.	-	holla
4	US	11/21/2019	5	I got this app and everything has been differe...	4.4.5	holla

df.Rating.value_counts()

5    32761
1     8807
4     6418
3     4885
2     3185
Name: Rating, dtype: int64

df.to_csv("data/reviews.csv", index=False)

Review#

Instead of asking Apple or finding a secret API like the Washington Post, we used an app marketing site to find App Store reviews of the apps we were interested in. They didn't have a download button, though, so we wrote a simple scraper to pull them down.

After obtaining the reviews, cleaned them a bit and we combined them into one spreadsheet and added columns for racism, bullying, and unwanted sexual behavior that we'll fill in later manually.

Discussion topics#

Is pulling data from a secondary source okay?

How do we know that they list all available reviews on the site that we obtained the reviews from?

Do we need all of the reviews, or could we have filtered them at this point to narrow our field down?

Scraping app store reviews#

Select your options and scrape#

Combine and add columns#

Clean up the rating#

Review#

Discussion topics#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects