Weizhang2017/scraping_dynamic.md

## scraping_dynamic.md

      
    Raw
  

              scraping_dynamic.md
            
          
    Scraping dynamic HTML in Python with Selenium

When a web page is opened in a browser, the browser will automatically execute JavaScript and generate dynamic HTML content. It is common to make HTTP request to retrieve the web pages. However, if the web page is dynamically generated by JavasSript, a HTTP request will only get source codes of the page. Many websites implement Ajax to send information to and retrieve data from server without reloading web pages. To scrape Ajax-enabled web pages without losing any data, one solution is to execute JavaScript using Python packages and scrape the web page that is completely loaded. Selenium is a powerful tool to automate browsers and load web pages with the functionality to execute JavaScript.
1. Start Selenium with a WebDriver

Selenium does not contain a web browser. It calls an API on a WebDriver which opens a browser. Both Firefox and Chrome have their own WebDrivers that interact with Selenium. If you do not need a browser UI, PhantomJS is a good option that loads web pages and executes JavaScript at the background. In the following example, I will use Chrome WebDriver.
Before starting Selenium with a WebDriver, install Selenium pip install Selenium and download Chrome WebDriver
Start Selenium with a WebDriver. By running the following code, a Chrome browser pops up.
from selenium import WebDriver
driver = WebDriver.Chrome('./chromedriver') #specify the path of the WebDriver
2. Dynamic HTML

Let's take this web page as an example: https://www.u-optic.com/plano-convex-spherical-lens/en.html. This page makes Ajax requests to retrieve data and then generate the page content dynamically. Suppose we are interested in the data listed in the HTML table. They are not present in the original HTML source code. A simple HTTP request will only retrieve the page source code without the data.
A closer look at the table generated by JavaScript in a browser:

3. Start scraping

There are two ways to scrape dynamic HTML. The more obvious way is to load the page in Selenium WebDriver. The WebDriver automatically executes Ajax requests and subsequently generates the full web page. After the web page is loaded completely, use Selenium to acquire the page source in which the data is present.
However, on the example web page, due to table pagination, the table only shows 10 records. Multiple Ajax requests have to be made in order to retrieve all records.
Inspect the web page, under Network tab, we find 2 Ajax requests from which the web page loads the data to construct the tables.

By copying and pasting the urls into a browser or making HTTP requests using Python Requests library, we retrieve 10 records in JSON.
{"draw":1,"recordsTotal":1564,"recordsFiltered":1564,"data":[{"id":66,"material_code":"4001010101","model":..."}]}

The returned JSON data indicates there are 1564 records in total. A closer look at the Ajax url reveals that the number of records to be retrieved is specified under the parameter "length" in the url.

There are 62 items in the first table and 1564 items in the second table. Thus we change the value for the parameter "length" in the url accordingly.
Making requests for the data directly is much more convenient than parsing the data from web pages using Xpath or CSS selector.
4. Search for Ajax request urls in WebDriver logs

The Ajax request urls are hidden inside the JavaScript codes. We can search in WebDriver's performance log which logs events for Ajax requests. To retrieve performance logs from WebDriver, we must specify the argument when creating a WebDriver object:
from selenium.WebDriver.common.desired_capabilities import DesiredCapabilities

caps = DesiredCapabilities.CHROME
caps['goog:loggingPrefs'] = {'performance': 'ALL'}
driver = WebDriver.Chrome('./chromedriver', desired_capabilities=caps)
driver.get('https://www.u-optic.com/plano-convex-spherical-lens/en.html')
log = driver.get_log('performance')
The performance log records network activities that the WebDriver performed when loading the web page.
[{'level': 'INFO',
  'message': '{"message":{"method":"Network.responseReceivedExtraInfo","params":{"..."}',
  'timestamp': 1596881833630},
 {'level': 'INFO',
  'message': '{"message":{"method":"Network.responseReceived","params":{"..."}''
 ]

The value of the key "message" is a string in JSON. Parse the the string using Python json module and we find the Ajax requests to retrieve data are made under the method "Network.requestWillBeSent". The url has the path: "/api/diy/get_product_by_type".
{
	'method': 'Network.requestWillBeSent',
	'params': {
	     ....
		'request': {
			...
			'url': 'https://www.u-optic.com/api/diy/get_product_by_type?...start=0&length=10...'
		},
		...
	}
}

We use regular expression to find these urls.
import json
import re

pattern = r'https\:\/\/www\.u\-optic\.com\/api\/diy\/get\_product\_by\_type.+'

urls = list() # a list to store Ajax urls

for entry in log:
    message = json.loads(entry['message'])
    if message['message']['method'] == 'Network.requestWillBeSent':
        if re.search(pattern, message['message']['params']['request']['url']):
            
            urls.append(message['message']['params']['request']['url'])
Additional notes:

When the WebDriver loads the web page, it may take a few seconds for the WebDriver to make Ajax requests and then generate the page content. Thus it is recommended to configure the WebDriver to wait for some time until the section we intend to scrape is loaded completely. In this example, we want to scraped the data in the table. The data is placed under class "text-bold". Thus we set the WebDriver to wait for 5s until the class 'text-bold' gets loaded. If the section does not get loaded in 5s, a TimeoutException will be thrown.
from selenium.WebDriver.support.ui import WebDriverWait
from selenium.WebDriver.support import expected_conditions as EC
from selenium.WebDriver.common.by import By

wait_elementid = "//a[@class='text-bold']"
wait_time = 5
WebDriverWait(self.driver, wait_time).until(EC.visibility_of_element_located((By.XPATH, wait_elementid)))
5. Conclusion

Dynamically generated web pages are different from their source codes and thus we cannot scrape the web pages by HTTP requests. Executing JavaScript with Selenium is a solution to scrape the web pages without losing any data. Furthermore, if the data to be scraped is retrieved via Ajax requests, we may search for the request urls in the performance logs of the WebDriver and retrieve the data directly by making HTTP requests.