Skip to content

Instantly share code, notes, and snippets.

@miraculixx
Last active February 29, 2024 16:34
Show Gist options
  • Star 13 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save miraculixx/2f9549b79b451b522dde292c4a44177b to your computer and use it in GitHub Desktop.
Save miraculixx/2f9549b79b451b522dde292c4a44177b to your computer and use it in GitHub Desktop.
Python multiprocess parallel selenium web scraping with improved performance

How to run this

(output as of September 29, 2023)

$ python scraper.py
Does flying slower actually save fuel?
Is non-consented video recording admissable evidence in a civil trial in Maryland?
Iteration counts of AMG solver changes in parallel
Two switch flyback converter MOSFETs voltage stress
Inconsistency in index contraction
Airline forcibly changed return flight destination city over a month in advance. Are we eligible for compensation?
Blender Accessibility Features
Daisy chaining APs or connect them into the central router?
I (rev)?(pal)? the source code, you (rev)?(pal)? the input!
Does a company have to have your login information to verify your identity?
Is it ok to use std::ignore in order to discard a return value of a function to avoid any related compiler warnings?
verifying the Taylor expansion of ln(1+x) satisfies the properties of logarithm
What do we know about Andy Kaufman's SNL audition?
Science fiction story where a human is searching for immortality and meets an alien that has been searching for thousands of years
"bieten" with the meaning of "to ensure"
Funny Numbers :D
What is meant by software and hardware implementations of cryptograpic schemes? How to do it?
Does interspecies breastfeeding occur in the wild?
Where should I stop in this intersection when turning left?
Why does ranges::for_each return the function?
Do any power loads require both power lines disconnected by the "off" switch?
Probability Puzzle from a Quant Interview
Is there a resource for learning to read mathematical notation/equations/formulae?
My Medieval kingdom has birth control, why is the population so high?
beautifulsoup4
lxml
requests
selenium
urllib3
# answer to https://stackoverflow.com/q/53475578/890242
import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool, Pool
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
def get_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
titles = [str(urljoin(url,items.get("href"))) for items in soup.select(".question-hyperlink")]
return titles
threadLocal = threading.local()
def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(options=chromeOptions)
setattr(threadLocal, 'driver', driver)
return driver
def get_title(url):
driver = get_driver()
driver.get(url)
sauce = BeautifulSoup(driver.page_source,"lxml")
item = sauce.select_one("h1 a").text
print(item)
if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
ThreadPool(5).map(get_title, get_links(url))
@himan94
Copy link

himan94 commented Jun 9, 2020

Hey mate, I had a question regarding the code.

While using the map function to get the titles, the code calls the get_title function 50 times and I was wondering if it would open 50 browsers?

If yes, what would be the best practice for reducing the memory usage while maintaining the speed of parallel processing?

Thanks

@Sory-Noroc
Copy link

Doesn't work if ThreadPool is changed to Pool tho... Getting pickle error

@miraculixx
Copy link
Author

Hey mate, I had a question regarding the code.

While using the map function to get the titles, the code calls the get_title function 50 times and I was wondering if it would open 50 browsers?

If yes, what would be the best practice for reducing the memory usage while maintaining the speed of parallel processing?

Thanks

This starts a ThreadPool of 5 threads, each thread will have only one web browser open at anyone time.

@miraculixx
Copy link
Author

miraculixx commented Mar 19, 2021

Doesn't work if ThreadPool is changed to Pool tho... Getting pickle error

That would indicate that the get_links function returns a list of unpickable objects. Should be easy to fix.

Update: fixed by using str(...) for returned values in get_links(), see updated code

@ngomezleal
Copy link

ngomezleal commented Feb 11, 2022

@miraculixx hello friend I hope you're very well.
Friend I'm running your code in pycharm but it return:

Process finished with exit code 0

How to do, for return results?
I hope your answer.

@Tusenka
Copy link

Tusenka commented Sep 29, 2023

I think it is better to use snake notation here;)

@Tusenka
Copy link

Tusenka commented Sep 29, 2023

chrome_options
instead of
chromeOptions

@miraculixx
Copy link
Author

miraculixx commented Sep 29, 2023

@miraculixx hello friend I hope you're very well. Friend I'm running your code in pycharm but it return:

Process finished with exit code 0

How to do, for return results? I hope your answer.

Welcome to the wonderful world of web scraping ;-) Stackoverflow have changed their output slightly. If've updated the code along with an example output, as I ran it just now.

@Hypnos999
Copy link

Hi, Is it normal that when using ThreadPool with multiple Selenium instances that stay open for a long period of time (even an hour) sometimes when i call the threadPool .submit() or .map() the webdrivers seems to freeze and never run the script, they stay there doing nothing without even raising a timeout error. From a bit of research this could be a window "focus" error, also i've noticed that keeping my pc in a desktop view (when you can see all your open windows/apps, idk how to call it) it works flawlessly, but it isn't a solution at all. Hope you can help me understand better this problem. Have a nice day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment