Apps have increased the interaction with the world. Shopping, music, news, and dating are just a few of the things you may do on social media. If you can think of it, there's probably an app for it. Some apps are superior to others. You can learn what people like and dislike for an app by analyzing the language of user reviews. Sentiment Analysis and Topic Modeling are two domains of Natural Language Processing (NLP) that can aid with this, but not if you don't have any reviews to examine!
You need to scrape and store some reviews before we get ahead of ourselves. This blog will show you how to do just that with Python code and the google-play-scraper and PyMongo packages. You have several options for storing or saving your scraped reviews.
Real-Time APIs for crawling the Google Play Store is provided by google-play-scraper. It can be used to obtain:
App information includes the app's title and description, as well as the price, genre, and current version.
App evaluations
You can use the app function to retrieve app information, and the reviews or reviews_ all functions to get reviews. We will go through how to use the app briefly before concentrating on how to get the most out of reviews. While reviews all are convenient in some situations, we prefer working with reviews. Once we get there, we will explain why and how with plenty of code.
Initiating with Google-Play-Scraper
Step 1: Obtain App IDs
To scrape each app, you'll need one piece of information: the app's ID code. This can be discovered on the Google Play Store's URL for the app's page. The component you'll need comes just after "id=", as illustrated in the image below.
In other circumstances, the URL terminates with the app ID. In situations like these, you only need the section between "id=" and "&."
Your most recent work will be a collection of applications for mental health, mindfulness, and self-care. We will keep track of a lot of different information on a spreadsheet when exploring apps. This seemed like a reasonable place to keep each app's ID.
Step 2: Installing and Importing
Here, we will import and what is used earlier, including PyMongo. Also, initially you will need to install MongoDB. Guide for installing the community edition will be found here.
To be able to import each of the following, pip should be installed as needed:
import pandas as pd # for scraping app info and reviews from Google Play from google_play_scraper import app, Sort, reviews # for pretty printing data structures from pprint import pprint # for storing in MongoDB import pymongo from pymongo import MongoClient # for keeping track of timing import datetime as dt from tzlocal import get_localzone # for building in wait times import random import time
You will also install Mongo, establish a new database for the project, and add new collections (essentially the MongoDB equivalent to the tables of relational databases). You will also have one collection for app information and another for app reviews.
## Set up Mongo client client = MongoClient(host='localhost', port=27017) ## Database for project app_proj_db = client['app_proj_db'] ## Set up new collection within project db for app info info_collection = app_proj_db['info_collection'] ## Set up new collection within project db for app reviews review_collection = app_proj_db['review_collection']
MongoDB is a lazy database and collection creator. This means that unless we start entering documents (MongoDB's equivalent of rows in relational database tables) into our collections, none of these features will exist.
Scraping App Data
The Platform is ready for scraping now. What we need is list of app IDs. You can download a csv copy of the spreadsheet and read it using Pandas DataFrame.
## Read in file containing app names and IDs app_df = pd.read_csv('Data/app_ids.csv') app_df.head() ## Read in file containing app names and IDs app_df = pd.read_csv('Data/app_ids.csv') app_df.head()
We can simply fetch the list of the app names and IDs to loop through during scraping:
## Get list of app names and app IDs app_names = list(app_df['app_name']) app_ids = list(app_df['android_appID'])
When we scrape reviews, we'll use app names. All we need is app ids to scrape general app information with the app function. The code below loops over each program, scraping its information from the Google Play Store and saving it in a list.
## Loop through app IDs to get app info app_info = [] for i in app_ids: info = app(i) del info['comments'] app_info.append(info) ## Pretty print the data for the first app pprint(app_info[0])
The last line provides a dictionary with various details about our initial programme. The following is a shortened version of that output:
{'adSupported': None, 'androidVersion': '4.1', 'androidVersionText': '4.1 and up', 'appId': 'com.aurahealth', 'containsAds': False, 'contentRating': 'Everyone', 'contentRatingDescription': None, 'currency': 'USD', 'description': '<b>Find peace everyday with Aura</b> - discover thousands of ' ... (truncated), 'descriptionHTML': '<b>Find peace everyday with Aura</b> - discover ' ... (truncated), 'developer': 'Aura Health - Mindfulness, Sleep, Meditations', 'developerAddress': '2 Embarcadero Center, Fl 8\nSan Francisco, CA 94111', 'developerEmail': 'hello@aurahealth.io', 'developerId': 'Aura+Health+-+Mindfulness,+Sleep,+Meditations', 'developerInternalID': '8194778368040078712', 'developerWebsite': 'http://www.aurahealth.io', 'editorsChoice': False, 'free': True, 'genre': 'Health & Fitness', ... }
Let's use PyMongo's insert many methods to safely save the app details in our info collection. insert many expects a list of dictionaries, which we've just created.
## Insert app details into info_collection info_collection.insert_many(app_info)
You can query that dataset straight to a DataFrame with a single line of code whenever you want to start working with it!
## Query the collection and create DataFrame from the list of dicts info_df = pd.DataFrame(list(info_collection.find({}))) info_df.head()
Scraping App Reviews
It is always better to work with the review’s functions instead of reviews_all. Here are the reasons
- If you truly want all the reviews, you can still acquire them.
- Instead of doing everything for a single app all at once, you may fragment the procedure for each app. This is beneficial since it provides you with options. You can do the following:
- Get updates on how many reviews you've scraped on a regular basis.
- Instead of waiting till the finish, save scraped data as you go.
Anatomy of the “Reviews” Function
Two variables are returned by the reviews function. We're looking for review data in the first variable. The second variable is an information token that we'll require if we wish to scrape more than the count number of reviews.
rvws, token = reviews( 'co.thefabulous.app', # app's ID, found in app's url lang='en', # defaults to 'en' country='us', # defaults to 'us' sort=Sort.NEWEST, # defaults to Sort.MOST_RELEVANT filter_score_with=5, # defaults to None (get all scores) count=100 # defaults to 100 # , continuation_token=token )
The app ID is the first argument you'll need to offer to reviews. Sorting reviews is done in one of two ways: by most recent or by whatever Google Play believes is the most relevant. You can also filter reviews based on their score.
The count parameter's main purpose is to tell the function how many reviews it should retrieve before ending. The following is taken from the google-play-scraper documentation:
"An excessively high count can pose complications. Because Google Play supports a limit of 200 reviews per page, it is designed to paginate and recrawl by 200 until the number of results reaches count."
**As a side point, setting count to infinity is equivalent to setting reviews all to infinity, which seems excessive to me.
Count, in my opinion, is a better way to think about and use batch size. Simply set the number of reviews to 200, return the reviews along with your token, and utilize your token in the next iteration of the reviews function.
Review Scraping
Let us break the code
- Scrapes Google Play reviews by iterating through a list of app IDs.
- Stores the reviews in a MongoDB collection on a regular basis.
- Prints progress updates about the scraping operation.
Step 1: Setting up The Loop
You had previously saved our lists of app names and IDs. The app names list isn't necessarily required for scraping. There is a reason behind this. You will start out for loop to go through all of our apps in this block of code. Just double-check that your lists of names and IDs are same.
## Loop through apps to get reviews for app_name, app_id in zip(app_names, app_ids): # Get start time start = dt.datetime.now(tz=get_localzone()) fmt= "%m/%d/%y - %T %p" # Print starting output for app print('---'*20) print('---'*20) print(f'***** {app_name} started at {start.strftime(fmt)}') print() # Empty list for storing reviews app_reviews = [] # Number of reviews to scrape per batch count = 200 # To keep track of how many batches have been completed batch_num = 0
Step 2: Scrape First Batch of Reviews
You will need to add 2 keys to every newly obtained review dictionaries. Because the data gathered for each review does not explicitly identify which app the review was for, attaching these identifiers is beneficial. A potential crisis has been averted!
# Retrieve reviews (and continuation_token) with reviews function rvws, token = reviews( app_id, # found in app's url lang='en', # defaults to 'en' country='us', # defaults to 'us' sort=Sort.NEWEST, # start with most recent count=count # batch size ) # For each review obtained for r in rvws: r['app_name'] = app_name # add key for app's name r['app_id'] = app_id # add key for app's id # Add the list of review dicts to overall list app_reviews.extend(rvws) # Increase batch count by one batch_num +=1 print(f'Batch {batch_num} completed.') # Wait 1 to 5 seconds to start next batch time.sleep(random.randint(1,5))
Step 3: Store the Review IDs from the Very First Batch
Each review has a distinct identification. We need to save these before gathering our next batch of reviews so that we may compare them later.
# Append review IDs to list prior to starting next batch pre_review_ids = [] for rvw in app_reviews: pre_review_ids.append(rvw['reviewId'])
Step 4: Set and Loop through A Maximum Number of Batches
Here we have received a token with the first batch of reviews hence we can loop through every batch of 200 reviews. We will set the maximum number of batches to 5,000 in the code below by using range(4999) (we already got our first batch). This implies we'll get the first million reviews, assuming there are any.
# Loop through at most max number of batches for batch in range(4999): rvws, token = reviews( # store continuation_token app_id, lang='en', country='us', sort=Sort.NEWEST, count=count, # using token obtained from previous batch continuation_token=token ) # Append unique review IDs from current batch to new list new_review_ids = [] for r in rvws: new_review_ids.append(r['reviewId']) # And add keys for name and id to ea review dict r['app_name'] = app_name # add key for app's name r['app_id'] = app_id # add key for app's id # Add the list of review dicts to main app_reviews list app_reviews.extend(rvws) # Increase batch count by one batch_num +=1
Step 5: Break the Loop if Nothing is Added
You will need to compare the collection of review IDs before scraping the current batch to the set of review IDs as you have now after absorbing the current batch. If the two sets are the same length, that signifies we've stopped adding new reviews to our database. As a result, you will interrupt the loop and go on to the next app.
# Break loop and stop scraping for current app if most recent batch # did not add any unique reviews all_review_ids = pre_review_ids + new_review_ids if len(set(pre_review_ids)) == len(set(all_review_ids)): print(f'No reviews left to scrape. Completed {batch_num} batches.\n') break # all_review_ids becomes pre_review_ids to check against # for next batch pre_review_ids = all_review_ids
Scrape again if the lengths differ. You will reassign our current list of all review IDs to the pre review ids variable before starting the next batch.
Step 6: Save the Data and Print an Update After every ith Batch
It's wonderful to get an update on how things are doing when you're scraping tens of thousands or even millions of reviews. Perhaps more essential, it's comforting to know that your information is being securely saved as you travel. Every 100 batches, the following code accomplishes both.
# At every 100th batch if batch_num%100==0: # print update on number of batches print(f'Batch {batch_num} completed.') # insert reviews into collection review_collection.insert_many(app_reviews) # print update about num of reviews inserted store_time = dt.datetime.now(tz=get_localzone()) print(f""" Successfully inserted {len(app_reviews)} {app_name} reviews into collection at {store_time.strftime(fmt)}.\n """) # empty our list for next round of 100 batches app_reviews = [] # Wait 1 to 5 seconds to start next batch time.sleep(random.randint(1,5))
If you are looking to scrape Google Play Store Data, contact Web Screen Scraping today!!!
Comments
Post a Comment