Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Last active November 2, 2023 13:33
Show Gist options
  • Save eliasdabbas/857d6c601ee666be5b385e0fd0139da3 to your computer and use it in GitHub Desktop.
Save eliasdabbas/857d6c601ee666be5b385e0fd0139da3 to your computer and use it in GitHub Desktop.
Get a summary of the currently running crawl jobs (using the advertools crawler)
from subprocess import run
from functools import partial
run = partial(run, text=True, capture_output=True)
def running_crawls():
"""Get details of currently running spiders.
Get a DataFrame showing the following details:
* pid: Process ID. Use this to identify (or stop) the spider that you want.
* started: The time when this spider has started.
* elapsed: The elapsed time since the spider started.
* %mem: The percentage of memory that this spider is consuming.
* %cpu: The percentage of CPU that this spider is consuming.
* args: The full command that was used to start this spider. Use this to identify
the spider(s) that you want to know about.
* output_file: The path to the output file for each running crawl job.
* crawled_urls: The current number of lines in ``output_file``.
"""
ps = run(['ps', 'xo', 'pid,start,etime,%mem,%cpu,args'])
ps_stdout = ps.stdout.splitlines()
df = pd.DataFrame([line.split(maxsplit=5) for line in ps_stdout[1:]], columns=ps_stdout[0].split())
df['output_file'] = df['ARGS'].str.extract('-o (.*?\.jl)')[0]
df_subset = df[df['ARGS'].str.contains('scrapy runspider')].reset_index(drop=True)
if df_subset.empty:
return pd.DataFrame()
crawled_lines = run(['wc', '-l'] + df['output_file'].str.cat(sep=' ').split())
crawl_urls = [int(line.strip().split()[0]) for line in crawled_lines.stdout.splitlines()]
crawl_urls = crawl_urls[:min(len(crawl_urls), len(df_subset))]
df_subset['crawled_urls'] = crawl_urls
df_subset.columns = df_subset.columns.str.lower()
return df_subset
@eliasdabbas
Copy link
Author

Sample output DataFrame

Screenshot 2023-11-02 at 1 29 16 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment