How to Scrape Dynamic E-Commerce Product Pages in Python Using BeautifulSoup and Selenium?

Installation

pip install selenium pip install beautifulsoup4 pip install requests

Page Scraping

import pandas as pd from selenium import webdriver from bs4 import BeautifulSoup import re import requests import time url = 'https://books.toscrape.com/catalogue/page-1.html' driver = webdriver.Chrome() driver.implicitly_wait(30) driver.get(url) soup = BeautifulSoup(driver.page_source,'lxml') driver.quit()

Scraping Elements

Missing Elements

[Return null value if inner element is missing else return text of inner element for x in all outer elements]

Put that all together in the Pandas DataFrame

df = pd.DataFrame(list(zip([None if x == None else x.string for x in soup.find_all('h3')], [None if x.find(attrs={'class':'price_color'}) == None else x.find(attrs={'class':'price_color'}).string.replace('£','') for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})], [None if x.find(attrs={'class':'instock availability'}).text == None else x.find(attrs={'class':'instock availability'}).text.strip() for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})], [None if x.find(attrs={'class':re.compile(r'star-rating$')}).get('class') == None else x.find(attrs={'class':re.compile(r'star-rating$')}).get('class')[1] for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})])), columns=['product_name','price','availability','rating'])
def scrape_page(url): driver = webdriver.Chrome() driver.implicitly_wait(30) driver.get(url) soup = BeautifulSoup(driver.page_source,'lxml') driver.quit() df = pd.DataFrame(list(zip([None if x == None else x.string for x in soup.find_all('h3')], [None if x.find(attrs={'class':'price_color'}) == None else x.find(attrs={'class':'price_color'}).string.replace('£','') for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})], [None if x.find(attrs={'class':'instock availability'}).text == None else x.find(attrs={'class':'instock availability'}).text.strip() for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})], [None if x.find(attrs={'class':re.compile(r'star-rating$')}).get('class') == None else x.find(attrs={'class':re.compile(r'star-rating$')}).get('class')[1] for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})])), columns=['product_name','price','availability','rating']) return df scrape_page('https://books.toscrape.com/catalogue/page-1.html')

Pagination

def scrape_multiple_pages(url,pages): #Input parameters of url and number of pages to scrape. Put {} in place of page number in url. page_number = list(range(pages)) df = pd.DataFrame(columns=['product_name','price','availability','rating']) for i in range(len(page_number)): #Loops through each page number in url. if requests.get(url.format(i+1)).status_code == 200: #If the url returns an OK 200 reponse, scrape the page. df_page = scrape_page(url.format(i+1)) df = df.append(df_page) time.sleep(5) #Wait 5 seconds. else: break return df scrape_multiple_pages('https://books.toscrape.com/catalogue/page-{}.html',pages=2)

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
3i Data Scraping

3i Data Scraping

44 Followers

3i Data Scraping is an Experienced Web Scraping Service Provider in the USA. We offering a Complete Range of Data Extraction from Websites and Online Outsource.