Forem: Mohammed Mushahid Qureshi

Better web scraping using headless instance of Chrome

Mohammed Mushahid Qureshi — Mon, 29 Mar 2021 10:00:06 +0000

Have you ever written a web scraping program in python using requests and seen some differences in the content. You might not have gotten exactly the same results as when opening the website on a browser.

This is because some sites use javascript to render the content on the client side or the site could be making api calls to the server and then rendering that content.

Moreover requests made using the requests library can register as older browsers because of missing headers and can lead the server to respond with a page that is compatible with older browsers.

This problem can be easily solved by using webdrivers. The chrome webdriver can be downloaded from https://chromedriver.chromium.org/ . Make sure to download the driver that matches your chrome version and put the chromedriver.exe in the folder you're running python program from or add it to PATH.

Usually with requests the code for web scraping looks something like this:

from bs4 import BeautifulSoup
import requests

res = requests.get(url)
res_soup = BeautifulSoup(res.text, 'html.parser')
print(res_soup.prettify())


for image in res_soup.findAll('img') :
    print(image)

for image in res_soup.findAll('img') :
    imageSources = image['src']
    print(imageSources)

In the above code we are making a GET request to the url stored in url and then using BeautifulSoup to parse the text from the response into HTML and store it in res_soup. We can then look for tags like the img tag in this response using the findAll() method that returns all the tags with the given filter(here img tags).

Now using the selenium chrome webdriver the code looks like this:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
driver.get(url)

html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = BeautifulSoup(html, 'html.parser')
print(len(sel_soup.findAll('img')))
images = []
for image in sel_soup.findAll('img'):
    #print(image)
    try:
        src = image["src"]
    except:
        src = image["data-src"]
    images.append(src)

First we define that we are using the chrome webdriver for selenium and launch it using driver = webdriver.Chrome().Then we use the .get() method of the driver to open a website. Next, we extract the html content of the website by executing some javascript in the browser using .execute_script() method of the browser. Then we use BeautifulSoup to parse this text into HTML and use the findAll method to find all the image tags. The notable difference is here is that some websites that render content on the client side may use data-src attribute of the img tag in HTMl to parse a data-uri which may contain the base64 encoded image.
Note: selenium also has methods to obtain the html content of individual tags directly from the driver objects.

We can also use these webdrivers in headless instances so that the window doesn't appear i.e. the browser is not presented to the user. This can be easily performed by adding the few lines mentioned below to our code:

from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920,1200")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")   
chrome_options.add_argument("--ignore-certificate-errors")

First we import the Options class which is used to pass argumnets and options to our webdriver. We use --disable-gpu since we are not going to displaying anything in the browser and it would save some system resources. The --headless argument is the most essential one here as it basically tells the driver to launch the browser in headless mode. The --window-size argument is specified since by default window size in headless mode is 800x600 and that can cause issues on some websites. The --disable-extensions is provided to stop the extensions from interfering in some cases. The --no-sandbox and --disable-dev-shm-usage are needed in some cases to help reduce the chrome webdriver crashes in headless mode. Finally, the --ignore-certificate-errors is to allow chrome to ignore errors due to SSL certificates.

We also need to change

driver = webdriver.Chrome(options=chrome_options)

driver = webdriver.Chrome(options=chrome_options)

to tell the driver to use our Options.

Docs for reference
Selenium Python Docs
BeautifulSoup Docs

My web scraper for google images can be found here:

mushahidq / py_webScraper

A simple web scraper using beautifulsoup and requests

py_webScraper

A simple web scraper using beautifulsoup and requests

File Descriptions:

simmpleWebScraper.py : It is a simple web scraper built using requests and beautifulsoup to get data from any website.

googleImages.py : Contains a google images web scraper to obtain images from Google using ChromeWebDriver, Selenium and BeutifulSoup

googleImagesWithRequests.py : This webscraper for google images uses the requests library and beautifulsoup. sampleGoogleImages.html : This is the page which is obtained when using the requests library.

As it can be seen that using the WebDriver more images can be obtained because it enables the use of Javascript while using the requests library we get pure HTML and CSS.

View on GitHub

Also check out my previous post:

Identifying Colours in Images using Python and OpenCV

Mohammed Mushahid Qureshi ・ Feb 25 ・ 4 min read

#python #machinelearning #webdev

Identifying Colours in Images using Python and OpenCV

Mohammed Mushahid Qureshi — Thu, 25 Feb 2021 13:33:49 +0000

Hi everyone,

I'll try to explain in this post how we can used an unsupervised ML model to find the most used colours in an image.

KMeans Clustering

KMeans clustering is a clustering algorithm which means it is used to divide the given data into subgroups based such that data in one subgroup or cluster is different from the data in another subgroup or cluster. Clustering is one of the methods used in unsupervised machine learning. This means that the performance of the algorithm is not evaluated by comparing its output to true labels of the data, instead the goal is to investigate the structure of the data by grouping it into clusters or subgroups.

One of the applications of KMeans clustering is that it can be used to group together colours in an image to find the most used colours in a given image.

Building a model in a Jupyter Notebook or Google Colaboratory

Imports

We start by importing the libraries we are going to use. We'll be using matplotlib.pyplot to generate the pie chart, Opencv to read the image, KMeans algorithm from the sklearn.cluster package, rgb2lab to covert image colours to lab and deltaE_cie76 to compare them. We will also use the os module to combine paths when reading files and the Counter from collections library to extract the count.
The imports should look like this:

import os
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import cv2
from collections import Counter
from skimage.color import rgb2lab, deltaE_cie76

Since we are working in Jupyter notebook we also need to add %matplotlib inline to tell matplotlib to display the plots inside the notebook.

Using OpenCV to read the image

Images can be read using the method cv2.read() by passing the complete path as an argument. We can use pyplot to then plot this image.

image = cv2.read(os.path.join('path_to_image', 'image.jpg')
plt.imshow(image)

At this point you'll notice that there is something wrong with colour of the image that is plotted. This is because OpenCV reads the image in Blue Green Red colours and to view the actual image we need to convert to Red Green Blue using the method cv2.CvtColor).

image = cv2.CvtColor(image, cv2.COLOR_BGR2RGB)
plt.imshow(image)

Colour Identification

We first need a function to convert RGB values to hex colour codes so we can use these as labels in our pie chart. We are using string formatting for this which
displays our integers as hexadecimal numbers. We could also use the method binascii.hexlify().

def RGB2HEX(color):
    return "#{:02x}{:02x}{:02x}".format(int(color[0]), int(color[1]), int(color[2]))

Getting colours from an image

First we'll reduce the image size to reduce the execution time of our program. We also need to convert the shape of the array containing the image to something that we could pass to our cluster so convert the array to 2 dimensions.

mod_img = cv2.resize(img, (600, 400), interpolation = cv2.INTER_AREA)
mod_img = mod_img.reshape(mod_img.shape[0]*mod_img.shape[1], 3)

Now we implement KMeans

number_of_colours = 8
clf = KMeans(n_clusters = number_of_colours)
labels = clf.fit_predict(modified_image)

KMeans algorithm creates clusters based on the supplied count of clusters which in our case will be the top colours. We use fit and predict on the same image and extract the prediction into the variable labels.

Counting the colours and plotting the Pie chart

We use Counter to get the count of all labels i.e. how many times each value is present in labels. To find the colours, we use clf.cluster_centers_ where all the centroids of all clusters are stored. We iterate over through the keys in counts and get ordered_colours which is way of knowing which data belongs to which cluster(here we use that to group similar colours to provide better result) and now we have the values for how many times a colour is present in the image. Finally we convert the values to hex codes and store them in hex_colours.

We plot a pie chart with the values from the counts and the labels from hex_colours. We also get the colours for the pie chart from hex_colors

counts = Counter(labels)
center_colours = clf.cluster_centers_
ordered_colours =  [center_colors[i] for i in counts.keys()]
hex_colors = [RGB2HEX(ordered_colors[i]) for i in counts.keys()]
plt.figure(figsize = (8, 6))
plt.pie(counts.values(), labels = hex_colors, colors = hex_colors)

Bringing everything together into a function

This function accepts three arguments: path to the image, no of colours to be identifies and a Boolean show_chart to display the pie chart.

def get_colours(img_path, no_of_colours, show_chart):
    img = get_img(img_path)
    mod_img = cv2.resize(img, (600, 400), interpolation = cv2.INTER_AREA)
    mod_img = mod_img.reshape(mod_img.shape[0]*mod_img.shape[1], 3)

    #Define the clusters
    clf = KMeans(n_clusters = no_of_colours)
    labels = clf.fit_predict(mod_img)

    counts = Counter(labels)
    counts = dict(sorted(counts.items()))

    center_colours = clf.cluster_centers_
    ordered_colours = [center_colours[i] for i in counts.keys()]
    hex_colours = [RGB2HEX(ordered_colours[i]) for i in counts.keys()]
    rgb_colours = [ordered_colours[i] for i in counts.keys()]

    if (show_chart):
        plt.figure(figsize = (8, 6))
        plt.pie(counts.values(), labels = hex_colours, colors = hex_colours)
        return
    else:
        return rgb_colours

You can find the google colab notebook here: https://github.com/mushahidq/py_colour_identifier/blob/main/colour_identifier.ipynb

I'll soon be making another post on how to turn this into a web app and deploy to heroku.

This is my first post so some feedback would be much appreciated.