Skip to content

Instantly share code, notes, and snippets.

@livibetter
Created March 9, 2012 18:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save livibetter/2007998 to your computer and use it in GitHub Desktop.
Save livibetter/2007998 to your computer and use it in GitHub Desktop.
Script for checking link-ins with Google Webmaster Tools data
*.checked
*.links
*.csv
wtal.re.sh

Webmaster Tools All Links Checker

It processes CSV files from Webmaster Tools and enables you to list newly added links. You can read my blog post about this script.

Usage

  1. Download the CSV: Go website Dashboard » Traffic » Links to Your Site » Who links the most More » Download latest links.
  2. Run ./wtal.sh *.csv.
  3. Run the command wtal.sh gives you to list new links, e.g. grep <TIMESTAMP> "<DOMAIN>.links" | cut -d ' ' -f 2.

Changelog

2012-12-09T19:27:11Z

The filename format has changed from

All_Links_<DOMAIN>_<TIMESTAMP>.csv

to

<DOMAIN>_<TIMESTAMP>_ExternalLinks_AllLinks.csv

2012-03-09

First release.

#!/bin/bash
# Copyright (c) 2012 Yu-Jie Lin
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of
# this software and associated documentation files (the "Software"), to deal in
# the Software without restriction, including without limitation the rights to
# use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
# of the Software, and to permit persons to whom the Software is furnished to do
# so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
# Gist: https://gist.github.com/2007998
# Blog: http://blog.yjl.im/2012/03/checking-link-ins-with-google-webmaster.html
[[ -f "wtal.re.sh" ]] && source "wtal.re.sh"
REGEX_EXCLUDES=(
"${REGEX_EXCLUDES[@]}"
'^Links$'
'https://bugs.launchpad.net/[^/]+/\+(bug|source)/.+'
'https://gist.github.com/[^/]+/.+'
'http://.*\.blogspot\.com/...._.._.._archive\.html'
'http://([^.]*.)?technorati.com/'
'https?://.*\.wordpress\.com/..../(../(../)?)?$'
'/(archive|author|category|directory|feeds?|page|tag(ged|s?)?)/'
'?(page|tag)='
'?(format|output|type)=(atom|rss)'
'&view=print'
)
for FILE_CSV in "$@"; do
FILE_BASE="${FILE_CSV%_ExternalLinks_AllLinks.csv}"
# only the domain name
FILE_MAIN="${FILE_BASE%_*}"
FILE_TS="${FILE_BASE##*_}"
FILE_LINKS="${FILE_MAIN}.links"
FILE_CHECKED="${FILE_MAIN}.checked"
FILE_CSV_LINKS="${FILE_CSV}.links"
touch "$FILE_LINKS" "$FILE_CHECKED"
echo -n "$FILE_CSV... "
# Don't process this CSV, it has been processed before
grep --max-count=1 "$FILE_TS" "$FILE_CHECKED" &>/dev/null && echo already checked && continue
# Work with both Sample Links and Latest Links
CSV="$(sed '1d;s/,....-..-..$//' "$FILE_CSV")"
for RE in "${REGEX_EXCLUDES[@]}"; do
CSV="$(echo "$CSV" | egrep -v "$RE")"
done
echo "$CSV" | sort > "$FILE_CSV_LINKS"
NEW_LINKS="$(cut -d ' ' -f 2 "$FILE_LINKS" | sort | diff - "$FILE_CSV_LINKS" | grep '>' | sed "s/>/$FILE_TS/")"
if [[ -z "$NEW_LINKS" ]]; then
echo no new links
else
echo "$(echo "$NEW_LINKS" | tee -a "$FILE_LINKS" | wc -l) new links"
echo
echo " grep $FILE_TS \"$FILE_LINKS\" | cut -d ' ' -f 2"
echo
fi
echo "$FILE_TS" >> "$FILE_CHECKED"
done
@livibetter
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment