Skip to main content

How Web Scraping Is Used To Extract Amazon Prime Data Using Selenium And Beautifulsoup?

 How-Web-Scraping-is-Used-to-Extract-Amazon-Prime-Data-Using-Selenium-And-Beautifulsoup

Selenium is a great tool of web scraping, but has some flaws which is normal because it was designed primarily for testing online applications. However, BeautifulSoup was created particularly for web scraping and is also an excellent tool.

But even BeautifulSoup has its own flaws as when data to be scraped is behind the wall and it requires user authentication or some other actions from user.

This is where Selenium may be used to automate user interactions with the website, and Beautiful Soup will be used to scrape the data once we are in the wall.

When BeautifulSoup and Selenium are combined, you get a perfect web scraping tool. Selenium can also scrape data but BeautifulSoup is far better.

We will use BeautifulSoup and Selenium to scrape movie details from Amazon Prime Video in several categories, such as description, name, and ratings, and then filter the movies depending on the IMDB ratings.

Let’s discuss the process of scraping Amazon Prime data.

Firstly, import the necessary modules

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
from time import sleep
from selenium.common.exceptions import NoSuchElementException
import pandas as pd

Make three empty lists to keep track of the movie information.

movie_names = []
movie_descriptions = []
movie_ratings = []

Chrome Driver must be installed to work this program properly. Make sure you install the driver that relates to your browser version of chrome.

Now, define a function called open_site() that opens the sign-in page of Amazon Prime.

def open_site():
options = webdriver.ChromeOptions()
options.add_argument("--disable-notifiactions")
driver = webdriver.Chrome(executable_path='PATH/TO/YOUR/CHROME/DRIVER',options=options)
driver.get(r'https://www.amazon.com/ap/signin?accountStatusPolicy=P1&clientContext=261-1149697-3210253&language=en_US&openid.assoc_handle=amzn_prime_video_desktop_us&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.primevideo.com%2Fauth%2Freturn%2Fref%3Dav_auth_ap%3F_encoding%3DUTF8%26location%3D%252Fref%253Ddv_auth_ret')
sleep(5)
driver.find_element_by_id('ap_email').send_keys('ENTER YOUR EMAIL ID')
driver.find_element_by_id('ap_password').send_keys('ENTER YOUR PASSWORD',Keys.ENTER)
sleep(2)
search(driver)    

Let's create a search() function that looks for the genre specified.

def search(driver):
    driver.find_element_by_id('pv-search-nav').send_keys('Comedy Movies',Keys.ENTER)
    
    
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("scrollTo(0, document.body.scrollHeight);")
        sleep(5)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            Break
        last_height = new_height
    html = driver.page_source
    Soup = soup(html,'lxml')
    tiles = Soup.find_all('div',attrs={"class" : "av-hover-wrapper"})
    
    for tile in tiles:
        movie_name = tile.find('h1',attrs={"class" : "_1l3nhs tst-hover-title"})
        movie_description = tile.find('p',attrs={"class" : "_36qUej _1TesgD tst-hover-synopsis"})
        movie_rating = tile.find('span',attrs={"class" : "dv-grid-beard-info"})
        rating = (movie_rating.span.text)
        try:
            if float(rating[-3:]) > 8.0 and float(rating[-3:]) < 10.0:
                movie_descriptions.append(movie_description.text)
                movie_ratings.append(movie_rating.span.text)
                movie_names.append(movie_name.text)
                print(movie_name.text, rating)
        except ValueError:
            Pass

The function searches genre and scrolls till the bottom of page because Amazon Prime Video scrolls endlessly as it uses JavaScript executor and then call driver to acquire page_source. This source is then utilized and feeded to BeautifulSoup.

To be sure that the if statement is looking for movies with a rating of more than 8.0 but less than 10.0.

Let's make a pandas data frame to hold all our movie information.

def dataFrame():
	    
details = {
    'Movie Name' : movie_names,
    'Description' : movie_descriptions,
    'Rating' : movie_ratings
}
data = pd.DataFrame.from_dict(details,orient='index')
data = data.transpose()
data.to_csv('Comedy.csv')

Now let’s try the function we already discussed

open_site()

Result

Result

The result you get will not look the same as here it is formatted such as Column width, text wrap, etc. otherwise it will look almost same.

Conclusion

Hence, BeautifulSoup and Selenium work well together and provides best results considering Amazon Prime Video but Python has other tools also like Scrapy and it is also equally strong.

Looking for best web scraping services to get Amazon Prime data? Contact Web Screen Scraping today!

Request for quote!

Comments

Popular posts from this blog

Why Entrepreneurs Should Use E-Commerce Scrapers?

  For retail shops, the competition has become limited as it comprises other shops near your location. However, online e-commerce stores have similar online stores across the world. So, it’s almost impossible to keep an eye on competitors online amongst thousands worldwide. For retail shops, the competition gets limited as it comprises other shops near your place. However, online stores have very much similar online shops in the world in terms of competition. Relevant news, updates, and information associated to customer preferences help an organization of working accordingly. These information scraps could drive e-commerce ventures to wonderful heights. In that regard, data scraping is important for your business. Using data from an online field is a skill, which can assist e-commerce entrepreneurs in striking gold! Why Web Scraping is Important for E-Commerce Websites? Web data scraping has arose as a vital approach for e-commerce businesses, particularly in providing rich data i...

What Are The Top 10 Advantages Of Amazon Data Scraping?

  Amazon is identified as the world’s biggest Internet retailer as far as total sales, as well as market capitalization, is concerned. This e-commerce platform consists of a huge amount of data, which is important to online businesses. Here in this blog, we will discuss the top 10 reasons why people scrape data from Amazon. Online shoppers are progressively becoming more self-confident in buying their smartphones or laptops online. Today, many shoppers do their online searching on Amazon and avoid search engines like Yahoo or Google altogether. The trustworthy base of Prime members is invaluable for Amazon because they are key to the huge success of this retailer. Although to convert typical online consumers to customers, e-commerce merchants need to use data analytics for optimizing their offerings. Why Do You Require Amazon Scraping? Being a retailer, it’s easy to think about how important data and information Amazon carries: reviews, ratings, products, special deals, news, etc. ...

How to Scrape Glassdoor Job Data using Python & LXML?

  This Blog is related to scraping data of job listing based on location & specific job names. You can extract the job ratings, estimated salary, or go a bit more and extract the jobs established on the number of miles from a specific city. With extraction Glassdoor job, you can discover job lists over an assured time, and identify job placements that are removed &listed to inquire about the job that is in trend. In this blog, we will extract Glassdoor.com, one of the quickest expanding job hiring sites. The extractor will scrape the information of fields for a specific job title in a given location. Below is the listing of Data Fields that we scrape from Glassdoor: Name of Jobs Company Name State (Province) City Salary URL of Jobs Expected Salary Client’s Ratings Company Revenue Company Website Founded Years Industry Company Locations Date of Posted Scraping Logics First, you need to develop the URL to find outcomes from Glassdoor. Meanwhile, we will be scraping lists by j...