Skip to main content

How Web Scraping Is Used To Scrape Google Play Store Data?

 How-Web-Scraping-Is-Used-To-Scrape-Google-Play-Store-Data

Apps have increased the interaction with the world. Shopping, music, news, and dating are just a few of the things you may do on social media. If you can think of it, there's probably an app for it. Some apps are superior to others. You can learn what people like and dislike for an app by analyzing the language of user reviews. Sentiment Analysis and Topic Modeling are two domains of Natural Language Processing (NLP) that can aid with this, but not if you don't have any reviews to examine!

You need to scrape and store some reviews before we get ahead of ourselves. This blog will show you how to do just that with Python code and the google-play-scraper and PyMongo packages. You have several options for storing or saving your scraped reviews.

Real-Time APIs for crawling the Google Play Store is provided by google-play-scraper. It can be used to obtain:

App information includes the app's title and description, as well as the price, genre, and current version.

App evaluations

You can use the app function to retrieve app information, and the reviews or reviews_ all functions to get reviews. We will go through how to use the app briefly before concentrating on how to get the most out of reviews. While reviews all are convenient in some situations, we prefer working with reviews. Once we get there, we will explain why and how with plenty of code.

Initiating with Google-Play-Scraper

Initiating-with-Google-Play-Scraper

Step 1: Obtain App IDs

To scrape each app, you'll need one piece of information: the app's ID code. This can be discovered on the Google Play Store's URL for the app's page. The component you'll need comes just after "id=", as illustrated in the image below.

In other circumstances, the URL terminates with the app ID. In situations like these, you only need the section between "id=" and "&."

Your most recent work will be a collection of applications for mental health, mindfulness, and self-care. We will keep track of a lot of different information on a spreadsheet when exploring apps. This seemed like a reasonable place to keep each app's ID.

Step 2: Installing and Importing

Here, we will import and what is used earlier, including PyMongo. Also, initially you will need to install MongoDB. Guide for installing the community edition will be found here.

To be able to import each of the following, pip should be installed as needed:

import pandas as pd
# for scraping app info and reviews from Google Play
from google_play_scraper import app, Sort, reviews

# for pretty printing data structures
from pprint import pprint

# for storing in MongoDB
import pymongo
from pymongo import MongoClient

# for keeping track of timing
import datetime as dt
from tzlocal import get_localzone

# for building in wait times
import random
import time

You will also install Mongo, establish a new database for the project, and add new collections (essentially the MongoDB equivalent to the tables of relational databases). You will also have one collection for app information and another for app reviews.

## Set up Mongo client
client = MongoClient(host='localhost', port=27017)

## Database for project
app_proj_db = client['app_proj_db']

## Set up new collection within project db for app info
info_collection = app_proj_db['info_collection']

## Set up new collection within project db for app reviews
review_collection = app_proj_db['review_collection']

MongoDB is a lazy database and collection creator. This means that unless we start entering documents (MongoDB's equivalent of rows in relational database tables) into our collections, none of these features will exist.

Scraping App Data

Scraping-App-Data

The Platform is ready for scraping now. What we need is list of app IDs. You can download a csv copy of the spreadsheet and read it using Pandas DataFrame.

## Read in file containing app names and IDs
app_df = pd.read_csv('Data/app_ids.csv')
app_df.head()
## Read in file containing app names and IDs
app_df = pd.read_csv('Data/app_ids.csv')
app_df.head()

We can simply fetch the list of the app names and IDs to loop through during scraping:

## Get list of app names and app IDs
app_names = list(app_df['app_name'])
app_ids = list(app_df['android_appID'])

When we scrape reviews, we'll use app names. All we need is app ids to scrape general app information with the app function. The code below loops over each program, scraping its information from the Google Play Store and saving it in a list.

## Loop through app IDs to get app info
app_info = []
for i in app_ids:
    info = app(i)
    del info['comments']
    app_info.append(info)


## Pretty print the data for the first app
pprint(app_info[0])

The last line provides a dictionary with various details about our initial programme. The following is a shortened version of that output:

{'adSupported': None,
'androidVersion': '4.1',
'androidVersionText': '4.1 and up',
'appId': 'com.aurahealth',
'containsAds': False,
'contentRating': 'Everyone',
'contentRatingDescription': None,
'currency': 'USD',
'description': '<b>Find peace everyday with Aura</b> - discover thousands of ' ... (truncated),
'descriptionHTML': '<b>Find peace everyday with Aura</b> - discover ' ... (truncated),
'developer': 'Aura Health - Mindfulness, Sleep, Meditations',
'developerAddress': '2 Embarcadero Center, Fl 8\nSan Francisco, CA 94111',
'developerEmail': 'hello@aurahealth.io',
'developerId': 'Aura+Health+-+Mindfulness,+Sleep,+Meditations',
'developerInternalID': '8194778368040078712',
'developerWebsite': 'http://www.aurahealth.io',
'editorsChoice': False,
'free': True,
'genre': 'Health & Fitness',
...
}

Let's use PyMongo's insert many methods to safely save the app details in our info collection. insert many expects a list of dictionaries, which we've just created.

## Insert app details into info_collection
info_collection.insert_many(app_info)

You can query that dataset straight to a DataFrame with a single line of code whenever you want to start working with it!

## Query the collection and create DataFrame from the list of dicts
info_df = pd.DataFrame(list(info_collection.find({})))
info_df.head()

Scraping App Reviews

Scraping-App-Reviews

It is always better to work with the review’s functions instead of reviews_all. Here are the reasons

  • If you truly want all the reviews, you can still acquire them.
  • Instead of doing everything for a single app all at once, you may fragment the procedure for each app. This is beneficial since it provides you with options. You can do the following:
  • Get updates on how many reviews you've scraped on a regular basis.
  • Instead of waiting till the finish, save scraped data as you go.

Anatomy of the “Reviews” Function

Two variables are returned by the reviews function. We're looking for review data in the first variable. The second variable is an information token that we'll require if we wish to scrape more than the count number of reviews.

rvws, token = reviews(
'co.thefabulous.app', # app's ID, found in app's url
lang='en',            # defaults to 'en'
country='us',         # defaults to 'us'
sort=Sort.NEWEST,     # defaults to Sort.MOST_RELEVANT
filter_score_with=5,  # defaults to None (get all scores)
count=100             # defaults to 100
# , continuation_token=token
)

The app ID is the first argument you'll need to offer to reviews. Sorting reviews is done in one of two ways: by most recent or by whatever Google Play believes is the most relevant. You can also filter reviews based on their score.

The count parameter's main purpose is to tell the function how many reviews it should retrieve before ending. The following is taken from the google-play-scraper documentation:

"An excessively high count can pose complications. Because Google Play supports a limit of 200 reviews per page, it is designed to paginate and recrawl by 200 until the number of results reaches count."

**As a side point, setting count to infinity is equivalent to setting reviews all to infinity, which seems excessive to me.

Count, in my opinion, is a better way to think about and use batch size. Simply set the number of reviews to 200, return the reviews along with your token, and utilize your token in the next iteration of the reviews function.

Review Scraping

Let us break the code

  • Scrapes Google Play reviews by iterating through a list of app IDs.
  • Stores the reviews in a MongoDB collection on a regular basis.
  • Prints progress updates about the scraping operation.

Step 1: Setting up The Loop

You had previously saved our lists of app names and IDs. The app names list isn't necessarily required for scraping. There is a reason behind this. You will start out for loop to go through all of our apps in this block of code. Just double-check that your lists of names and IDs are same.

## Loop through apps to get reviews
for app_name, app_id in zip(app_names, app_ids):
    
# Get start time
start = dt.datetime.now(tz=get_localzone())
fmt= "%m/%d/%y - %T %p"    

# Print starting output for app
print('---'*20)
print('---'*20)    
print(f'***** {app_name} started at {start.strftime(fmt)}')
print()

# Empty list for storing reviews
app_reviews = []

# Number of reviews to scrape per batch
count = 200

# To keep track of how many batches have been completed
batch_num = 0

Step 2: Scrape First Batch of Reviews

You will need to add 2 keys to every newly obtained review dictionaries. Because the data gathered for each review does not explicitly identify which app the review was for, attaching these identifiers is beneficial. A potential crisis has been averted!

# Retrieve reviews (and continuation_token) with reviews function
rvws, token = reviews(
    app_id,           # found in app's url
    lang='en',        # defaults to 'en'
    country='us',     # defaults to 'us'
    sort=Sort.NEWEST, # start with most recent
    count=count       # batch size
)




# For each review obtained
for r in rvws:
    r['app_name'] = app_name # add key for app's name
    r['app_id'] = app_id     # add key for app's id




# Add the list of review dicts to overall list
app_reviews.extend(rvws)


# Increase batch count by one
batch_num +=1 
print(f'Batch {batch_num} completed.')


# Wait 1 to 5 seconds to start next batch
time.sleep(random.randint(1,5))

Step 3: Store the Review IDs from the Very First Batch

Each review has a distinct identification. We need to save these before gathering our next batch of reviews so that we may compare them later.

# Append review IDs to list prior to starting next batch
pre_review_ids = []
for rvw in app_reviews:
    pre_review_ids.append(rvw['reviewId'])

Step 4: Set and Loop through A Maximum Number of Batches

Here we have received a token with the first batch of reviews hence we can loop through every batch of 200 reviews. We will set the maximum number of batches to 5,000 in the code below by using range(4999) (we already got our first batch). This implies we'll get the first million reviews, assuming there are any.

# Loop through at most max number of batches
for batch in range(4999):
rvws, token = reviews( # store continuation_token
    app_id,
    lang='en',
    country='us',
    sort=Sort.NEWEST,
    count=count,
    # using token obtained from previous batch
    continuation_token=token
)


# Append unique review IDs from current batch to new list
new_review_ids = []
for r in rvws:
    new_review_ids.append(r['reviewId'])
    # And add keys for name and id to ea review dict
    r['app_name'] = app_name # add key for app's name
    r['app_id'] = app_id     # add key for app's id


# Add the list of review dicts to main app_reviews list
app_reviews.extend(rvws)


# Increase batch count by one
batch_num +=1

Step 5: Break the Loop if Nothing is Added

You will need to compare the collection of review IDs before scraping the current batch to the set of review IDs as you have now after absorbing the current batch. If the two sets are the same length, that signifies we've stopped adding new reviews to our database. As a result, you will interrupt the loop and go on to the next app.

# Break loop and stop scraping for current app if most recent batch
    # did not add any unique reviews
all_review_ids = pre_review_ids + new_review_ids
if len(set(pre_review_ids)) == len(set(all_review_ids)):
    print(f'No reviews left to scrape. Completed {batch_num} batches.\n')
    break

# all_review_ids becomes pre_review_ids to check against 
    # for next batch
pre_review_ids = all_review_ids

Scrape again if the lengths differ. You will reassign our current list of all review IDs to the pre review ids variable before starting the next batch.

Step 6: Save the Data and Print an Update After every ith Batch

It's wonderful to get an update on how things are doing when you're scraping tens of thousands or even millions of reviews. Perhaps more essential, it's comforting to know that your information is being securely saved as you travel. Every 100 batches, the following code accomplishes both.

# At every 100th batch
if batch_num%100==0:
    
    # print update on number of batches
    print(f'Batch {batch_num} completed.')
    
    # insert reviews into collection
    review_collection.insert_many(app_reviews)
    
        # print update about num of reviews inserted
    store_time = dt.datetime.now(tz=get_localzone())
    print(f"""
    Successfully inserted {len(app_reviews)} {app_name} 
    reviews into collection at {store_time.strftime(fmt)}.\n
    """)
    
    # empty our list for next round of 100 batches
    app_reviews = []

# Wait 1 to 5 seconds to start next batch
time.sleep(random.randint(1,5))

If you are looking to scrape Google Play Store Data, contact Web Screen Scraping today!!!

Comments

Popular posts from this blog

What Is The Impact Of Browser Fingerprints On Web Scraping?

  Web scraping is one of the most important aspects of delivering data to clients in a readable format. Since web scraping technology became popular, businesses and websites have become cautious about having their data scraped off the internet. As a result, businesses have discovered how to identify web crawlers and avoid having their data released. Many websites have created a variety of strategies to prevent data crawling or web scraping in the recent past. Although some of them are simple to hack, web scraping businesses may easily land on their websites and take data. The websites, on the other hand, have generated three identifiers that may be monitored using cookies, IP addresses, and fingerprints. You should be aware of how your system's IP address and cookies can be used to track it. However, one question must be asked, what is a browser fingerprint, and how does it prevent online scraping? Another approach employed by anti-scraping systems is to build a unique fingerprint ...

How An Amazon Dealer Can Be Benefitted With Web Scraping?

  Because of the growth of e-commerce stores as well as a progressively tech-savvy world, many dealers now get a chance to drastically improve their presence online as well as make a money-making business. Whereas Walmart and Amazon have chiefly dominated in the avenue, amongst others, online dealers mainly depend on these platforms for making increasing revenues using attractive online sales and deals. E-commerce has become more intelligent as well as targeted marketing. This big shift could be credited to the usage of Artificial Intelligence and Machine Learning in the bid for predicting the next huge shopping trends as well as influencing customer preferences. A huge amount of shoppers have moved to online shopping, as well as for that, the same has occurred with sellers and also who are creating their portfolios on different platforms like Flipkart, eBay, Amazon, Ali Baba, etc. Though, to convert typical online consumers into customers, e-commerce dealers require to use data an...

What Are The Benefits Of Web Scraping In The Healthcare Industry?

  Data breaches, insufficient information, and loss of records are some issues in the industry. Now to understand and solve this problem, old methods or methods with the latest touch can be used. Healthcare is one of those industries where there is a lot of data available but little attention is given to the solution of the same. The healthcare industry has maximum data but nobody is working on it with complete interest. Separating data manually on a large scale is almost impossible and too hard. So scrape Healthcare data automatically by using  web scraping services  that will help the industry as a whole and eliminate errors. Benefits of Web Scraping In the Healthcare Industry Web scraping is the best tool that can assist you in collecting health care data and eliminate all the errors of large-scale extraction. Web scraping can also assist the healthcare industry in several ways, such as: Extracting Essential Information There is a treasure of data in the healthcare ind...