Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Created June 2, 2024 19:17
Show Gist options
  • Save eliasdabbas/a1b4ad5aff2fcdf5dad2b0a3d24e9f83 to your computer and use it in GitHub Desktop.
Save eliasdabbas/a1b4ad5aff2fcdf5dad2b0a3d24e9f83 to your computer and use it in GitHub Desktop.
Filter non 200 status codes on a daily basis
import os
import datetime
import pandas as pd
today = datetime.datetime.now(datetime.UTC).strftime('%Y_%m_%d')
url_status_time = pd.concat(
pd.read_json(f'/path/to/status_codes/{file}',
lines=True)
for file in os.listdir('/path/to/status_codes'))
(url_status_time
[url_status_time['status'].ne(200)]
[['url', 'status', 'crawl_time']]
.to_csv(f'/path/to/non_200_codes/{today}.csv',
index=False))
@eliasdabbas
Copy link
Author

create a daily cron job, running after the previous script at 00:30 every day for example

crontab -e

# add this line to the end of the file, and modify paths as needed: 

30 0 * * * /path/to/venv/bin/python /path/to/filter_non_200_status_codes.py

Syncronize the filtered files with your local machine with rsync

rsync -avz YOURNAME@IP_ADDRESS:/path/to/non_200_codes/ /path/on/local/computer

This will synchronize the files in the folder /non_200_codes/ to the folder of your choosing on your local machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment