Skip to content

Instantly share code, notes, and snippets.

@kasparsd
Last active October 21, 2019 18:46
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save kasparsd/ac5ffe570e41bd55cbdd22dff5411c15 to your computer and use it in GitHub Desktop.
Save kasparsd/ac5ffe570e41bd55cbdd22dff5411c15 to your computer and use it in GitHub Desktop.
Finding WordPress in Alexa top 1 million sites, see http://crawler.wproll.com
#!/bin/bash
while IFS=',' read -r POS HOSTNAME; do
if cat checked.csv | grep -qxF "$HOSTNAME"; then
echo "Skipping $HOSTNAME, already checked."
continue
fi
# Look for `/wp-content/` in the HTML output
# ISWPORGCONTENT=$(curl -s -L -m 5 $HOSTNAME 2>&1 | tee "html/$HOSTNAME.txt" | grep "/wp-content/")
# Check the login cookie, see http://wordpress.stackexchange.com/a/54442
ISWPCOOKIE=$(curl -s -L -m 5 --head $HOSTNAME/wp-login.php 2>&1 | grep "=WP+Cookie+check;")
# Look for readme.html, skipping because we check for the cookie already
# ISWPORG=$(curl -s -L -m 5 $HOSTNAME/readme.html 2>&1 | grep "wordpress.org/support")
# Look for WP.com/VIP sites, no need again because of the cookie check
# ISWPCOM=$(curl -s -L -m 5 --head $HOSTNAME 2>&1 | grep "visit automattic.com/jobs")
if [[ $ISWPCOOKIE ]]; then
echo "$POS - $HOSTNAME is WP"
echo "$POS,$HOSTNAME" >> topwp.csv
else
echo "$POS - $HOSTNAME is not WP"
fi
echo $HOSTNAME >> checked.csv
done < top-1m.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment