Scrape Javascript heavy website on RaspberryPi3B+ using Python with Selenium

When I was trying to scrape a Javascript heavy website with my Raspberry using Python, I ran into some interesting issues that needed to be solved.

I found that modules like request,request_html, urlllib did not deliver the complete content with Javascripts websites containing shadow-dom (#shadowroot). When searching for solution i found some, like the use of PhantomJS or other discontinued modules.

The solution I found was using Chromedriver in headless mode. But the version I got my hands on kept throwing errors on the version of the browser.

After extensive searches I found the solution in:

1. Download the latest chromedriver from:

https://github.com/electron/electron/releases

(get the arvmv7 version)

2. Install this using the instructions i found on:

https://www.raspberrypi.org/forums/viewtopic.php?t=194176

  • mkdir /tmp
  • wget <url latest version arm7>
  • unzip <zip file>
  • mv chromedriver /usr/local/bin
  • sudo chmod +x /usr/local/bin/chromedriver
    sudo apt-get install libminizip1
    sudo apt-get install libwebpmux2
  • sudo apt-get install libgtk-3-0

In your code add these two arguments, when you start the driver:
-headless
-disable-gpu

3 Update the Chromium bowser

When trying to execute the script I still got the error on Chromium version.I was able to solve that using:

  • sudo apt-get install -y chromium-browser

IT WORKS

now the script finally worked

The Python Script to get the page content

from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

# Define the site to be opened
site = “http://….”

# Set Chrome Options
chrome_options = Options()
chrome_options.add_argument(“–headless”)
# Open Chrome Headless
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.set_page_load_timeout(20)
driver.get(site)

4. Analyze the content of the page

With the content of the page in driver it is possible to further decompose the page.

content1= driver.find_element_by_tag_name(‘…..’)
shadow_content1 = expand_shadow_element(content1)

To get access to the shadow element the function below needs to be used:

# function to expand a shadow element to useable content
def expand_shadow_element(element):
shadow_root = driver.execute_script(‘return arguments[0].shadowRoot’, element)
return shadow_root

Leave a Reply

Your email address will not be published. Required fields are marked *

two × four =