0%
August 30, 2021

Scrapping Images with Selenium and Beautifulsoup on Chrome

python

Introduction I: Scrap Elements by Tagname and ClassName

We start off by importing three libraries:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

We will browse our webpage by webdriver.Chrome. We will then use BeautifulSoup to extract data based on HTML file structure such as tag name, such as class, etc.

We use open rice as a data source as an example:

url = "https://www.openrice.com/zh/hongkong/restaurants?what=%E5%8F%B0%E9%A2%A8"

We sleep for 10 seconds to make sure all datas have been loaded before getting page_source:

browser = webdriver.Chrome("C:/Users/user/Repos/Python/2021-07-20-TrySelenium/chromedriver.exe")
browser.get(url)
time.sleep(10)
html = browser.page_source

Here html is a string of HTML source. We can now parse it using BeautifulSoup as follows

soup = BeautifulSoup(html, "html.parser")

We now scrap all div elements that bear the classname content-cell-wrapper:

print(len(soup.find_all("div", {"class": "content-cell-wrapper"})))

And we get 8. If we further study the HTML structure, we can distill the data by using regular expression.

Introduction II: Automatic Images Scrapping Through Google

Discussion and Pattern

We note that searching name in google images is the same as having GET-request to the https://www.google.com/search?q=name&tbm=isch&hl=en&sa=X, therefore we can combine selenium and BeautifulSoup to scrap all img elements.

We also note that each img grasps the src attribute to detemine the image source. However, not every src is of the form http://....jpg, .png, .gif, etc which provides a direct GET-request to the image. Indeed the frontend may create an url(which is to be specified in src) using URL.createObjectURL(). This url is not a direct link to the image anymore, instead this will become a binary data represented by base-64 encoded string.

For example, an image may require access tokens in cookie/local storage, for which the frontend developer cannot put an image link to src as it will create a get request without header.

Code Implementation

We divide the steps by functions from top to bottom. The method scrap accepes labels, which is used to

  • create GET-requests for image scrapping and;
  • create a folder ./download_dir/{label} which stores all the scrapped images.
# download ChromeDriver first according to your chrome version
# we need the path for .exe file.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
import re
import base64
import os
import sys

class GoogleScrapper:
  def __init__(
    self,
    query_url="https://www.google.com/search?q={}&tbm=isch&hl=en&sa=X",
    download_dir="./scrapped_birds",
    chromedriver_path="C:/Users/jameslcc/Desktop/chromedriver.exe"
  ):
    self.query_url = query_url
    self.download_dir = download_dir
    self.chromedriver_path = chromedriver_path

  def scrap_img_els(self, label, wait_before_scrapping=0):
    url = self.query_url.format(label)
    browser = webdriver.Chrome(self.chromedriver_path)
    browser.get(url)
    time.sleep(wait_before_scrapping)
    html = browser.page_source
    browser.close()

    soup = BeautifulSoup(html, "html.parser")

    return soup.find_all("img")

  def download_imgs(self, img_els, label):
    class_dir = f"{self.download_dir}/{label}"

    if not os.path.exists(class_dir):
      os.mkdir(class_dir)

    for i, img_el in enumerate(img_els):
      img_src = img_el.get('src', '')
      try:
        # if src refers to external link
        if img_src.startswith("http"):
          r = requests.get(img_src)
          if r.status_code == 200:
            contype_type = r.headers.get('content-type')
            # image/png, image/jpeg, etc
            if contype_type.startswith("image/"):
              file_ext = re.sub("image/", "",contype_type)
              file_path = f"{class_dir}/{label}-{i}." + file_ext
              with open(file_path, 'wb') as f:
                f.write(r.content)
                print(f"{file_path} has been saved")

        # if src refers to internal link
        if img_src.startswith("data:"):
          # example: data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ ...
          head, data = img_src.split(',', 1)
          file_ext = head.split(';')[0].split('/')[1]
          plain_data = base64.b64decode(data)

          with open(f"{class_dir}/{label}-{i}." + file_ext, 'wb') as f:
            f.write(plain_data)
            print(f"{class_dir}/{label}-{i}." + file_ext+ " has been saved")

      except:
        error_msg = sys.exc_info()[1]
        # raise Exception(error_msg)
        print(error_msg)

  def scrap(self, labels, wait_before_scrapping = 0):
    # catter for the case when labels is simply a string,  we want a list of labels
    if isinstance(labels, str):
      labels = [labels]

    for label in labels:
      img_els = self.scrap_img_els(label, wait_before_scrapping)
      self.download_imgs(img_els, label)

scrapper = GoogleScrapper()