Skip to content

Instantly share code, notes, and snippets.

@FiXato
Last active February 17, 2019 21:27
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save FiXato/35dff35450ebc630e1f4f87111021ad5 to your computer and use it in GitHub Desktop.
Save FiXato/35dff35450ebc630e1f4f87111021ad5 to your computer and use it in GitHub Desktop.
CLI tool for archiving the Google+ Comments frame for Blogger blogs

MOVED TO PLEXODUS-TOOLS

This project has been merged into Plexodus-Tools

CLI tool to archive Google+ Comments frame for Blogger blogs

Requirements

The scripts in this repository rely heavily on a variety of CLI utilities, and services:

For instance, this tool uses jq, (g)sed, (g)grep, curl, and various APIs to store the Google+ posts and comments locally for Blogger blogs with Google+ comments enabled.

API access

Blogger API V3

I use Blogger's official API to retrieve the Blog#id based on the Blog's URL, as well as to retrieve all blog post URLs based on the Blog#id.

Google+ API

I use the official Google Plus API to retrieve the top level posts (Activities, as they are called in the API) used in the Google+ Comments for Blogger widget, as well as their Comments resources.

(These above key values are obviously example ones; you'll need to replace them with your own actual keys.)

Google+ web integration API

I use the G+ Comments Widget from Google+'s Web Integrations API to get a list of Activity#id's associated with the Blogger blog post, based on its URL.

grep

grep is a CLI tool for regular expression-based filtering of files and data. Since the BSD version of grep that comes with macOS is rather limited, please install GNU grep instead. Not sure if it's through brew install grep or brew install gnu-grep.

Once installed on macOS, you should have access to GNU's version of grep via the g-prefix: ggrep. The scripts in this repository will automatically use ggrep rather than grep if they find it available on your system.

sed

sed the Stream EDitor is used for regular expression substitution. Since the BSD version of sed that comes with macOS is rather limited, you will likely need to install GNU sed instead with brew install gsed and replace calls to sed with gsed

Once installed on macOS, you should have access to GNU's version of sed via the g-prefix: gsed. The scripts in this repository will automatically use gsed rather than sed if they find it available on your system.

curl

curl is a CLI tool for retrieving online resources. For the tools in this project, I use it to send HTTP GET requests to the APIs.

jq

jq is an excellent CLI tool for parsing and filtering JSON: https://stedolan.github.io/jq/

It can be downloaded from its website, or through your package manager (for instance on macOS through brew install jq).

Usage:

Get a complete archive:

While you can manually pipe results into the fairly self-contained CLI scripts, it's easiest to just use the export_blog.sh script which does everything for you:

DEBUG=1 REQUEST_THROTTLE=1 PER_PAGE=500 ./export_blog.sh https://your.blogger.blog.example
  • DEBUG=1 enables the new debug messages on stderr
  • REQUEST_THROTTLE=1 will sleep in some cases for a single second, to throttle the API requests. Could probably be reduced to 0.5 or even just 0 to disable it.
  • PER_PAGE=500 will request 500 blog posts per API request. Not yet implemented for Comments, though I figured it was easiest to hardcode that to 500 since that's the max for an Activity anyway. Saves me paginations

The script tries to cache queries for a day to reduce needless re-querying of the APIs.

Get Blog ID for Blog URL

./getblogid.sh https://your.blogger.blog.example
#=> 12345

This will also store the $blog_id in data/blog_ids/${domain}.txt.

Get URLs for all blog posts for Blogger blog with given id

./getposturls.sh 1234

Which will output a newline-separated list of Blogger blog post URLs:

https://your.blogger.blog.example/2018/01/blog-title.html
https://your.blogger.blog.example/2019/01/second-blog-title.html

This will also store the Blogger.posts JSON responses in data/blog_post_urls/${blog_id}-${per_page}(-${page_token})-${year}-${month}-${day}.json and the list of post URLs in data/blog_post_urls/${blog_id}-${per_page}-${year}-${month}-${day}.txt.

Store Google+ Comments widget for blog post with given URL(s)

echo -e "https://your.blogger.blog.example/2018/01/blog-title.html\nhttps://your.blogger.blog.example/2018/01/another-blog-title.html" | ./store_comments_frame.sh

Google+ Comments Widget responses will be stored in ./data/comments_frames/your.blogger.blog.example/ with almost all special characters replaced by dashes

List Google+ ActivityIDs from Google+ Comments widget files/dumps

This tries to list all Activity IDs it can find in the Google+ Comments widget result.

echo -e "data/comments_frames/your.blogger.blog.example/2018-01-blog-title.html\ndata/comments_frames/your.blogger.blog.example/2018-01-another-blog-title.html" | ./get_activity_ids_from_comments_frame.sh

This will return a list of newline-separated Google+ Activity#id results you can use to look up Google+ Posts (Activities) through the Google Plus API (for as long as it's still available).

Get the Activity and its Comments for a given activity id (and convert to HTML)

This will look up the Google+ Activity JSON resource through the Google+ API's Activity.get endpoint, as well as its associated Google+ Comments JSON resources through the Google+ API's Comments.list endpoint.

echo -e "asdDSAmKEKAWkmcda3o01DMoame3" | ./get_comments_from_google_plus_api_by_activity_id.sh

For now it also tries to do conversion from JSON to very limited and ugly HTML, but that will likely be split off to its own script.

The JSON resources are stored at:

  • data/gplus/activities/$activity_id.json
  • data/gplus/activities/$activity_id/comments.json

The HTML output for now is stored at:

  • data/output/$domain/html/sanitised-blog-url-path.html
  • data/output/$domain/html/all-activities.html

Examples of combining commands

By piping the results of the commands in the right order, you can let the scripts do the hard work.

./getposturls.sh `sh ./getblogid.sh https://your.blogger.blog.example/`| store_comments_frame.sh

Notice

This set of scripts was coded over the past couple of days, since the January 30th announcement from Google made it clear that Blogger owners were running out of time quickly. As such, it has not been rigourously tested yet.

I have been able to make archives of 2 blogs so far, which seem to have been successful as of Feb 3rd.

Related Tools

Get contact data details for a profile id

You just need to pass the (numeric) profile id to the contact_data_profile_id.sh script:

./contact_data_for_profile_id.sh 123456

Passing the ID for a Custom Profile URL (e.g. +YonatanZunger for https://plus.google.com/+YonatanZunger), should also work:

./contact_data_for_profile_id.sh +YonatanZunger

Even passing the URL should work:

./contact_data_for_profile_id.sh https://plus.google.com/112064652966583500522

If you have a list of userIDs or profile URLs stored in memberslist.txt, with each ID on a separate line, you can use xargs to pass all these to the script. For instance with 3 request running in parallel, and deleting the target JSON file if a retrieval error occurs:

rm logs/failed-profile-retrievals.txt; cat memberslist.txt | xargs -L 1 -P 3 -I __UID__ ./contact_data_for_profile_id.sh __UID__ --delete-target

Or leave the JSON output files intact when a retrieval error occurs, so you can debug it:

rm logs/failed-profile-retrievals.txt; cat memberslist.txt | xargs -L 1 -P 3 ./contact_data_for_profile_id.sh

Thanks

  • Michael Prescott for having a need for this script, and for bouncing ideas back and forth with me.
  • Abdelghafour Elkaaba, for giving some ideas on how to import this back into Blogger's native comments
  • Edward Morbius, for moderating the Google+ Mass Migration community on Google+
  • Peggy K, for signal boosting and being a soundboard

Relevant Links

You can follow related discussions here:

#!/usr/bin/env bash
# encoding: utf-8
#FIXME: move this to an variables.env file
REQUEST_THROTTLE="${REQUEST_THROTTLE:-0}"
function debug() {
if [ "$DEBUG" == "1" -o "$DEBUG" == "true" -o "$DEBUG" == "TRUE" ]; then
echo -e "$@" 1>&2
fi
}
function machine_type() { #function provided by paxdiablo at https://stackoverflow.com/a/3466183
unameOut="$(uname -s)"
case "${unameOut}" in
Linux*) machine=Linux;;
Darwin*) machine=macOS;;
CYGWIN*) machine=Cygwin;;
MINGW*) machine=MinGw;;
*) machine="UNKNOWN:${unameOut}"
esac
echo "${machine}"
}
#FIXME: Should probably make this tool-independent and just run it once in an initial setup step.
function ensure_gnutools() {
gnu_utils=()
missing_gnu_utils=()
install_suggestion=""
if [ "$(machine_type)" == "macOS" ]; then
install_suggestion="$(machine_type)"' comes with the BSD versions of many of the CLI utilities this script uses. Unfortunately these are often limited in their usage options. I would suggest installing the GNU versions through Homebrew (https://brew.sh), which the script should automatically detect as Homebrew prefixes them with "g". E.g.: `brew install gawk findutils gnu-sed grep coreutils`'
fi
if [[ $(gnufind "--version") == *"GNU findutils"* ]]; then
gnu_utils+=('find')
elif [[ $(man $(gnufind_string) | head -2 | grep BSD) == *"BSD General Commands Manual"* ]]; then
debug 'You have the BSD version of `find` installed. This script relies on the GNU version. Please install it with `brew install findutils`'
missing_gnu_utils+=('find')
else
missing_gnu_utils+=('find')
fi
if [[ $(gnused "--version") == *"GNU sed"* ]]; then
gnu_utils+=('sed')
elif [[ $(man $(gnused_string) | head -2 | grep BSD) == *"BSD General Commands Manual"* ]]; then
debug 'You have the BSD version of `sed` installed. This script relies on the GNU version. Please install it with `brew install sed`'
missing_gnu_utils+=('sed')
else
missing_gnu_utils+=('sed')
fi
if [[ $(gnugrep "--version") == *"GNU grep"* ]]; then
gnu_utils+=('grep')
elif [[ $(gnugrep "--version") == *"BSD grep"* ]]; then
debug 'You have the BSD version of `grep` installed. This script relies on the GNU version. Please install it with `brew install grep`'
missing_gnu_utils+=('grep')
else
missing_gnu_utils+=('grep')
fi
if [[ $(gnudate "--version") == *"GNU coreutils"* ]]; then
gnu_utils+=('date')
elif [[ $(man $(gnudate_string) | head -2 | grep BSD) == *"BSD General Commands Manual"* ]]; then
debug 'You have the BSD version of `date` installed. This script relies on the GNU version. Please install it with `brew install coreutils`'
missing_gnu_utils+=('date')
else
missing_gnu_utils+=('date')
fi
if [ "${gnu_utils[*]}" != "" ]; then
debug "We've found GNU versions of the following utils: $( IFS=$', '; echo "${gnu_utils[*]}" )"
fi
if [ "${missing_gnu_utils[*]}" != "" ]; then
echo -e "You are missing the GNU versions of the following utils: $( IFS=$', '; echo "${missing_gnu_utils[*]}" )\n$install_suggestion" 1>&2
exit 255
fi
}
function gnused_string() {
if hash gsed 2>/dev/null; then
echo 'gsed -E'
else
echo 'sed -E'
fi
}
function gnudate_string() {
if hash gdate 2>/dev/null; then
echo 'LC_ALL=en_GB gdate'
else
echo 'LC_ALL=en_GB date'
fi
}
function gnugrep_string() {
if hash ggrep 2>/dev/null; then
echo 'ggrep -E'
else
echo 'grep -E'
fi
}
function gnufind_string() {
if hash gfind 2>/dev/null; then
echo 'gfind'
else
echo 'find'
fi
}
function gnused() {
if hash gsed 2>/dev/null; then
debug "gnused(): gsed -E \"$@\""
gsed -E "$@"
else
debug "gnused(): sed -E \"$@\""
sed -E "$@"
fi
}
function gnugrep() {
if hash ggrep 2>/dev/null; then
debug "gnugrep(): ggrep -E \"$@\""
ggrep "$@"
else
debug "gnugrep(): grep -E \"$@\""
grep "$@"
fi
}
function gnudate() { # Taken from https://stackoverflow.com/a/677212 by @lhunath and @Cory-Klein
#FIXME: find out how I can prevent the loss of the quotes around the format in the debug output
debug "gnudate(): $(gnudate_string) $@"
if hash gdate 2>/dev/null; then
gdate "$@"
else
date "$@"
fi
}
function gnufind() {
debug "gnufind(): $(gnufind_string) $@"
if hash gfind 2>/dev/null; then
gfind "$@"
else
find "$@"
fi
}
function sanitise_filename() {
debug "sanitising filename $@"
gnused 's/[^-a-zA-Z0-9_.]/-/g'
}
function domain_from_url() {
debug "Retrieving domain from URL $1: echo \"$1\" | $(gnused_string) 's/https?:\/\/([^/]+)\/.+/\1/g')"
domain="$(echo "$1" | gnused 's/https?:\/\/([^/]+)\/.+/\1/g')"
debug "Domain: $domain"
echo "$domain"
}
function path_from_url() {
debug "Retrieving path from URL $1: echo \"$1\" | $(gnused_string) 's/https?:\/\/([^/]+)\/(.+)$/\2/g')"
path="$(echo "$1" | gnused 's/https?:\/\/([^/]+)\/(.+)$/\2/g')"
debug "Path: $path"
echo "$path"
}
function ensure_path() {
debug "ensure_path called with: $@ "
if [ -z "$1" -o "$1" == "" ]; then
echo "ensure_path called with an undefined path \$1" 1>&2
exit 255
elif [ -z "$2" -o "$2" == "" ]; then
echo "ensure_path called with an undefined filename \$2" 1>&2
exit 255
else
mkdir -p "$1"
echo "$1/$2"
fi
}
function ensure_jq() {
if ! hash jq 2>/dev/null; then
echo 'This command requires the `jq` commandline utility, but unfortunately we could not find it installed on your system. Please get it through your system package manager, or from https://github.com/stedolan/jq/' 1>&2 && return 255
fi
}
function ensure_blogger_api() {
if [ -z "$BLOGGER_APIKEY" -o "$BLOGGER_APIKEY" == "" ]; then
echo "This command requires access to the Blogger API, but ENVironment variable BLOGGER_APIKEY is not set. Please set it to your Blogger API v3 API key." 1>&2
exit 255
fi
}
function ensure_gplus_api() {
if [ -z "$GPLUS_APIKEY" -o "$GPLUS_APIKEY" == "" ]; then
echo "This command requires access to the Google+ API via an API key, but ENVironment variable GPLUS_APIKEY is not set. Please set it to your Google Plus API key." 1>&2
exit 255
fi
}
function check_help() {
if [ -n "$1" -a "$1" == "--help" ]; then
if [ -z "$2" -o "$2" == "" ]; then
echo -e "Usage: $(basename "$0")\nUsage definition undefined" 1>&2
exit 255
fi
echo -e "$2"
exit 0
fi
}
function timestamp() {
function_usage=("Usage: timestamp(\"\$format\" \"\$additional_date_arguments\")")
function_usage+=("Supports a number of format shorthands, as well as custom format.")
function_usage+=("Examples:")
function_usage+=("timestamp \"day\" # $(DEBUG=0 gnudate +"%Y-%m-%d")")
function_usage+=("timestamp \"week\" # $(DEBUG=0 gnudate +"%Y-%W")")
function_usage+=("timestamp \"month\" # $(DEBUG=0 gnudate +"%Y-%m")")
function_usage+=("timestamp \"year\" # $(DEBUG=0 gnudate +"%Y")")
function_usage+=("timestamp \"iso-8601\" # $(DEBUG=0 gnudate --iso-8601)")
function_usage+=("timestamp \"iso-8601=seconds\" # $(DEBUG=0 gnudate --iso-8601=seconds)")
function_usage+=("timestamp \"rfc-3339\" # $(DEBUG=0 gnudate --rfc-3339=seconds)")
function_usage+=("timestamp \"rfc-email\" # or \"rfc-5322\" # $(DEBUG=0 gnudate --rfc-email)")
function_usage+=("timestamp \"rss\" # $(DEBUG=0 gnudate "+\"%a, %d %b %Y %H:%M:%S %z\"")")
function_usage+=("timestamp \"%H:%M:%S, %a %d-%m-%y\" -u -d '2019-02-03 18:23:01' # $(DEBUG=0 gnudate +"%H:%M:%S, %a %d-%m-%y" -u -d '2019-02-03 18:23:01')")
function_usage=$( IFS=$'\n'; echo "${function_usage[*]}" )
date_arguments=()
if [ -z "$1" ]; then
echo -e "timestamp() called without arguments.\n$function_usage" 1>&2 && return 255
elif [ "$1" == "day" ]; then
ts_format="%Y-%m-%d"
elif [ "$1" == "week" ]; then
ts_format="%Y-%W"
elif [ "$1" == "month" ]; then
ts_format="%Y-%m"
elif [ "$1" == "year" ]; then
ts_format="%Y"
elif [ "$1" == "iso-8601" ]; then
date_arguments+=("--$1")
ts_format=""
elif [ "$1" == "iso-8601-seconds" -o "$1" == "iso-8601=seconds" ]; then
date_arguments=("--iso-8601=seconds")
ts_format=""
elif [ "$1" == "rfc-3339" ]; then
date_arguments=("--$1=seconds")
ts_format=""
elif [ "$1" == "rfc-5322" -o "$1" == "rfc-email" ]; then
date_arguments=("--rfc-email")
ts_format=""
elif [ "$1" == "rss" -o "$1" == "rfc-822" ]; then # per https://groups.yahoo.com/neo/groups/rss-public/conversations/topics/536
ts_format="%a, %d %b %Y %H:%M:%S %z"
else
ts_format="$1"
fi
shift 1
# echo "ts_format: '$ts_format'"
# echo "date_arguments: '$date_arguments'"
if [ "$ts_format" == "" ]; then
gnudate "$date_arguments$@"
else
debug "gnudate(): $(gnudate_string) $date_arguments$@" +"\"$ts_format\""
DEBUG=0 gnudate "$date_arguments$@" +"$ts_format"
fi
}
# FIXME: replace calls to this with the more generic version
function timestamp_date() {
timestamp "%y-%m-%d" #FIXME: should probably just use 'day' instead of this double-digit year format.
}
function activity_file() {
activity_id="$1"
if [ "$activity_id" == "" ]; then
echo "activity_file() called with an undefined activity_id \$1" 1>&2
exit 255
else
activity_filepath="$(ensure_path "./data/gplus/activities" "$activity_id.json")"
debug "Filepath for Activity Resource $activity_id: $activity_filepath"
echo "$activity_filepath"
fi
}
function get_user_id() {
uid_regex="^([0-9]+|\+[a-zA-Z0-9_-]+)$"
user_id="$1"
if [[ "$user_id" =~ $uid_regex ]]; then
echo "$user_id"
return
fi
user_id="${user_id/#https:\/\/plus.google.com\/u\/?\//}"
user_id="${user_id/#https:\/\/plus.google.com\//}"
if [[ "$user_id" =~ $uid_regex ]]; then
echo "$user_id"
return
else
echo "Unrecognised user id $user_id ($1)" 1>&2
return 255
fi
}
function user_profile_file() {
debug "user_profile_file() called with: '$1' '$2' '$3' "
user_id="$1"
timestamp_format="$2"
shift 2
timestamp_args=("$@")
if [ -z "$timestamp_format" -o "$timestamp_format" == "" ]; then
suffix=""
else
if [ -n "$timestamp_args" ]; then
debug "timestamp_args: ${timestamp_args[@]}"
else
timestamp_args=('-u')
fi
debug "timestamp_args: ${timestamp_args[@]}"
suffix=".$(timestamp "$timestamp_format" ${timestamp_args[@]})"
fi
if [ "$user_id" == "" ]; then
echo "user_profile_filepath() called with an undefined user_id \$1" 1>&2
return 255
elif [[ "$user_id" == '*' || "$user_id" == 'all_users' ]]; then
echo "$(ensure_path "./data/gplus/users" "${user_id}${suffix}.json")"
return
fi
user_id="$(get_user_id "$user_id")"
return_code=$?
if (( $return_code >= 1 )); then
echo "Please supply the user id (\$1) in their numeric, +PrefixedCustomURLName form, or profile URL form." 1>&2 && exit 255
fi
if [ -n "$user_id" ]; then
user_profile_filepath="$(ensure_path "./data/gplus/users" "${user_id}${suffix}.json")"
debug "Filepath for GPlus People resource with ID $user_id: $user_profile_filepath"
echo "$user_profile_filepath"
else
echo "user_profile_filepath(): Please supply the user id (\$1) in their numeric form, or the +PrefixedCustomURLName form." 1>&2
exit 255
fi
}
function comments_file() {
activity_id="$1"
if [ "$activity_id" == "" ]; then
echo "comments_file() called with an undefined activity_id \$1" 1>&2
exit 255
else
comments_filepath="$(ensure_path "./data/gplus/activities/$activity_id" "comments.json")"
debug "Filepath for Comments Resource List for Activity with id $activity_id: $comments_filepath"
echo "$comments_filepath"
fi
}
function api_url() {
api_url_usage="Usage: api_url(\"\$api_name\" \"\$api_endpoint\" \"\$api_endpoint_action\" \$api_arguments)\nExamples:\n"
api_url_usage="${api_url_usage}api_url(\"gplus\" \"people\" \"get\" \$user_id)\n"
if [ -z "$1" ]; then
echo -e "api_url() called without arguments.\n$api_url_usage" 1>&2 && return 255
elif [ "$1" == "gplus" ]; then #https://developers.google.com/+/web/api/rest/index
gplus_api_url="https://www.googleapis.com/plus/v1"
if [ -z "$2" ]; then
echo -e "api_url(\"$1\") needs more arguments.\n$api_url_usage" 1>&2 && return 255
elif [ "$2" == "people" ]; then #https://developers.google.com/+/web/api/rest/latest/people
gplus_api_url="${gplus_api_url}/people"
if [ -z "$3" ]; then
echo -e "api_url(\"$1\" \"$3\") needs more arguments.\n$api_url_usage" 1>&2 && return 255
elif [ "$3" == "get" ]; then #https://developers.google.com/+/web/api/rest/latest/people/get
if [ -z "$4" ]; then
echo -e "api_url(\"$1\" \"$3\" \"\$user_id\") is missing its \$user_id.\n$api_url_usage" 1>&2 && return 255
elif [[ "$4" =~ ^([0-9]+$|^\+[a-zA-Z0-9_-]+)$ ]]; then
echo "$gplus_api_url/$4?key=$GPLUS_APIKEY"
else
echo -e "api_url(\"$1\" \"$3\" \"\$user_id\") \$user_id needs to be a numeric id, or the +PrefixedCustomURLName; '$4' was given.\n$api_url_usage" 1>&2 && return 255
fi
else
echo -e "api_url(\"$1\" \"$2\" \"$api_endpoint_action\") called with an unknown API endpoint action '$3'. $api_url_usage" 1>&2 && return 255
fi
else
echo -e "api_url(\"$1\" \"$api_endpoint\") called with an unknown API endpoint '$2'. $api_url_usage" 1>&2 && return 255
fi
else
echo -e "api_url(\"\$api_name\") called with an unknown API name '$1'. $api_url_usage" 1>&2 && return 255
fi
}
function cache_remote_document_to_file() { # $1=url, $2=local_file, $3=curl_args
function_usage="Usage: cache_external_document_to_file(\"\$url\" \"\$local_filepath\")\n"
if [ -z "$1" ]; then
echo -e "cache_external_document_to_file() called without arguments.\n$api_url_usage" 1>&2 && return 255
elif [[ "$1" =~ (^https?|ftps?):// ]]; then
if [ -z "$2" ]; then
echo -e "cache_external_document_to_file(\"$1\") needs more arguments.\n$function_usage" 1>&2 && return 255
elif [ ! -f "$2" ]; then
if [ -z "$3" ]; then
curl_args=""
else
curl_args="$3 "
fi
debug "cache_external_document_to_file(): Retrieving JSON from $1 and storing it at $2"
status_code="$(curl --write-out %{http_code} --silent --output ${curl_args}"$2" "$1")"
if [ "$status_code" -ne 200 ]; then
echo "cache_external_document_to_file(\"$1\" \"$2\" \"$3\"): Error while retrieving remote document. Status code returned: $status_code" 1>&2 && return 255
else
echo "$2"
fi
else
debug "Cache hit for ${1}: $2"
echo "$2"
fi
else
echo -e "cache_external_document_to_file(): unsupported protocol for \$url ('$1'); only http(s) and ftp(s) are currently supported.\n$function_usage" 1>&2 && return 255
fi
}
function merge_json_files() {
debug "merge_json_files(): Looking in '$1' for files matching case-insensitive filemask '$2'"
gnufind "$1" -iname "$2" -exec cat {} + | jq -s '.' > "$3"
}
#!/usr/bin/env bash
# encoding: utf-8
source "_functions.sh"
ensure_gplus_api||exit 255
ensure_gnutools||exit 255
LOG_DIR="./logs"
FAILED_FILES_LOGFILE="failed-profile-retrievals.txt"
usage="Usage: $0 \$profile_id [--delete-target]\nThe optional --delete-target flag will delete the JSON output file for the given Profile if an error occurs; without it, it will log the filepath to '$LOG_DIR/$FAILED_FILES_LOGFILE' and leave the JSON output file intact instead."
if [ -z "$1" ]; then
echo "Please supply a user id as \$1" 1>&2
exit 255
elif [ "$1" == '--help' -o "$1" == '-h' ]; then
echo -e usage
exit 255
fi
user_id="$(get_user_id "$1")"
if (( $? >= 1 )); then
echo "Please supply the user id (\$1) in their numeric, +PrefixedCustomURLName form, or profile URL form." 1>&2 && exit 255
fi
function handle_failure() {
echo "$json_output_file" >> $(ensure_path "$LOG_DIR" "$FAILED_FILES_LOGFILE")
debug "handle_failure(): \$1: '$1'"
if [ -n "$1" -a "$1" == '--delete-target' ]; then
debug "removing output file"
rm "$json_output_file"
fi
}
json_output_file="$(user_profile_file "$user_id" "day")" || exit 255
profile_api_url="$(api_url "gplus" "people" "get" "$user_id")" || exit 255
#filename=$(cache_remote_document_to_file "$profile_api_url" "$json_output_file") && cat "$filename" || handle_failure
filename=$(cache_remote_document_to_file "$profile_api_url" "$json_output_file")
if (( $? >= 1 )); then
# FIXME: find out why I can't exit with an error code, as it seems to make xargs running in parallel mode stop working when it encounters an error on one of its processes.
handle_failure "$2" # && exit 255
fi
echo "$filename"
#!/usr/bin/env bash
# encoding: utf-8
source _functions.sh
domain=$(echo "$1" | gnused 's/https?:\/\/([^/]+)\/?.*/\1/g' | sanitise_filename )
echo "Exporting Blogger blog at $domain"
mkdir -p "./data/output/$domain/html"
./getposturls.sh `sh ./getblogid.sh "$1"` | ./store_comments_frame.sh | ./get_activity_ids_from_comments_frame.sh | ./get_comments_from_google_plus_api_by_activity_id.sh > "./data/output/$domain/html/all-activities.html"
#FIXME: Make it so that you aren't basically repeating all these lookups, even though they are cached...
for filename in $(find "data/comments_frames/$domain/"* )
do
#FIXME: keep track of where you are, so you can abort, and continue again at a later time without having to restart.
echo "$filename"
echo $(basename "$filename")
echo "$filename" | ./get_activity_ids_from_comments_frame.sh | ./get_comments_from_google_plus_api_by_activity_id.sh > "data/output/$domain/html/$(basename "$filename")"
done
#!/usr/bin/env bash
# encoding: utf-8
source _functions.sh
stdin=$(cat)
while IFS= read -r blog_comments_widget_file
do
# Look for an ActivityID in the Google+ Comments widget file, trim the trailing double quote, and apply a unique sort
results="$(gnugrep -oP '^,"\K([a-z0-9]{22,})"' "$blog_comments_widget_file" | tr -d '"' | sort -u)"
debug "Found the following ActivityIDs in $blog_comments_widget_file: $results"
#FIXME: `continue` when $results is empty.
echo -e "$results"
done <<< "$stdin"
# < "${1:-/dev/stdin}"
#!/usr/bin/env bash
# encoding: utf-8
source "_functions.sh"
usage="usage: echo -e \$activity_id1\n\$activity_id2\n\$activity_idn | $(basename "$0")\nExample: TODO (get_activity_ids_from_comments_frame.sh)"
check_help "$1" "$usage" || exit 255
stdin=$(cat)
while IFS= read -r activity_id
do
# TODO: Add an activity counter; combined with size messages in other scripts that could give a rough estimate of how many items still left to go
if [ -z "$activity_id" -o "$activity_id" == "" ]; then
debug "Missing activity id: $activity_id"
continue
fi
debug "\nLooking up activities and comments for Activity with ID: $activity_id"
get_activity_api_url="https://www.googleapis.com/plus/v1/activities/$activity_id?key=$GPLUS_APIKEY"
debug "Activity.get API URL: $get_activity_api_url"
# TODO: make comments per-page count also configurable via ENV var. Not sure if the same one should be recycled.
# FIXME: comments should be ordered by published dates...
list_comments_api_url="https://www.googleapis.com/plus/v1/activities/$activity_id/comments?key=$GPLUS_APIKEY&maxResults=500&sortOrder=ascending"
debug "Comments.list API URL: $list_comments_api_url"
response_file="$(activity_file $activity_id)"
if [ ! -f "$response_file" ]; then
debug "Retrieving JSON from $get_activity_api_url and storing it at $response_file"
$(curl "$get_activity_api_url" > "$response_file")
else
debug "Cache hit for ${get_activity_api_url}: $response_file"
fi
#FIXME: HTML generation should probably be split off to a separate file, and possibly be done in a more suitable language.
title=$(jq -r ' .title' "$response_file")
displayName=$(jq -r ' .actor | .displayName' "$response_file")
authorPicture=$(jq -r ' .actor | .image | .url' "$response_file")
#TODO: store authorPictures locally (permanently)
permaLink=$(jq -r ' .url' "$response_file")
published=$(jq -r ' .published' "$response_file")
updated=$(jq -r ' .updated' "$response_file")
content=$(jq -r ' .object | .content' "$response_file")
plusOnesCount=$(jq -r ' .object | .plusoners | .totalItems ' "$response_file")
plusOnesLink=$(jq -r ' .object | .plusoners | .selfLink ' "$response_file")
resharesCount=$(jq -r ' .object | .resharers | .totalItems ' "$response_file")
resharesLink=$(jq -r ' .object | .resharers | .selfLink ' "$response_file")
commentsCount=$(jq -r ' .object | .replies | .totalItems ' "$response_file")
commentsLink=$(jq -r ' .object | .replies | .selfLink ' "$response_file")
#TODO: add Attachment data and also store that locally if possible
html="<!doctype html>\n"
html="$html"'<html class="no-js" lang=""><head><meta charset="utf-8"><meta http-equiv="x-ua-compatible" content="ie=edge">'"<title>Comments for ${activity_id}</title>"'<meta name="description" content=""><meta name="viewport" content="width=device-width, initial-scale=1"></head><body>'
html="${html}<article>"
if [ -z "$title" -o "$title" == "" -o "$title" == null ]; then
debug "Could not find .title item"
html="${html}<h1>No Google+ Comments Available for ${activity_id}</h1></article></body></html>"
echo -e "$html"
continue
else
debug "Found title: $title"
fi
html="${html}<h1>$title</h1>\n"
html="${html}<small class='authored'>Published by <span class='authorName'>$displayName</span><img class='authorPicture' src='$authorPicture' /> on <a href="$permaLink" class='publishedAt'>$published</a>\n"
html="${html} and updated on: <span class='updatedAt'>$updated</span></small><br />\n"
html="${html}<p>$content</p>\n"
html="${html}<div class='interaction'>"
html="${html}<span class='plusones'>+$plusOnesCount</span> <a href='$plusOnesLink'>plus-ones</a>"
#TODO: expand the list of plusoners
html="${html}<div class='reshares'>"
html="${html}<span class='reshareCount'>+$resharesCount</span> <a href='$resharesLink'>reshares</a>"
#TODO: expand the list of resharers
html="${html}</div>"
html="${html}<div class='comments'>"
html="${html}<span class='commentsCount'>$commentsCount</span> <a href='$commentsLink'>comments</a>"
if [ -n "$commentsCount" -a "$commentsCount" != "0" ]; then
debug "Activity with id $activity_id has $commentsCount comments:"
comments_file_path="$(comments_file $activity_id)"
if [ ! -f "$comments_file_path" ]; then
debug "Querying Comments.list API for Activity with ID $activity_id at $list_comments_api_url and storing to $comments_file_path"
$(curl "$list_comments_api_url" > "$comments_file_path")
else
debug "Cache hit for Comments.list for Activity with ID $activity_id: $comments_file_path"
fi
for row in $(cat "$(comments_file $activity_id)" | jq -r '.items[] | @base64'); do
# TODO: Add a counter: "Processing comment [1/xx]"
_jq() {
echo ${row} | base64 --decode | jq -r "$@"
}
html="${html}<div class='comment'>\n"
commentTitle=$(_jq '.title')
if [ -n "$commentTitle" -a "$commentTitle" == "" ]; then
debug "Found comment title: $commentTitle"
html="${html}<h3>$commentTitle</h3>\n"
fi
html="${html}<small class='authored'>Published by <span class='authorName'>$(_jq '.actor .displayName')</span><img class='authorPicture' src='$(_jq ' .actor | .image | .url')' /> on <a href="$(_jq '.url')" class='publishedAt'>$(_jq '.published')</a>\n"
html="${html} and updated on: <span class='updatedAt'>$(_jq '.updated')</span></small><br />\n"
html="${html}<p>$(_jq '.object | .content')</p>\n"
html="${html}<div class='interaction'>"
html="${html}<span class='plusones'>$(_jq '.plusoners | .totalItems ')</span> <a href='$(_jq '.object | .plusoners | .selfLink ')'>plus-ones</a>"
html="${html}</div>\n"
html="${html}</div>\n"
done
else
debug "Activity with id $activity_id has no (${commentsCount}) comments."
fi
html="${html}</div>"
html="${html}</div>"
html="${html}</article></body></html>\n"
echo -e "$html"
done <<< $stdin
#!/usr/bin/env bash
# encoding: utf-8
source "_functions.sh"
usage="usage: $(basename "$0") https://your.blogger.blog.example"
check_help "$1" "$usage" || exit 255
ensure_blogger_api||exit 255
if [ -z "$1" -o "$1" == "" ]; then
echo -e "$usage" 1>&2
exit 255
else
blog_url="$1"
fi
api_url="https://www.googleapis.com/blogger/v3/blogs/byurl?url=${blog_url}&key=${BLOGGER_APIKEY}"
domain=$(domain_from_url "$blog_url" | sanitise_filename)
path=$(ensure_path "data/blog_ids" "${domain}.txt")
if [ ! -f "$path" ]; then
curl -s "$api_url" | jq -r '.id' > "$path"
fi
cat "$path"
#!/usr/bin/env bash
# encoding: utf-8
source "_functions.sh"
usage="usage: $(basename "$0") \$blog_id\nExample: $(basename "$0") 12345\nOr: $(basename "$0") \"\$(./getblogid.sh https://your.blogger.blog.example)\""
check_help "$1" "$usage" || exit 255
ensure_blogger_api||exit 255
if [ -z "$1" -o "$1" == "" ]; then
echo -e "$usage" 1>&2
exit 255
else
blog_id="$1"
fi
PER_PAGE="${PER_PAGE:-500}"
api_url="https://www.googleapis.com/blogger/v3/blogs/${blog_id}/posts?key=${BLOGGER_APIKEY}&fetchBodies=false&fetchImages=false&status=live&maxResults=${PER_PAGE}"
function buildResponseFilename {
blog_id="$1"
pathSuffix="$2"
timestamp="$3"
extension="$4"
echo "${blog_id}${pathSuffix}-${timestamp}.${extension}"
}
function getResponsePath {
pageToken="$1"
debug "\$pageToken=${pageToken}"
pageTokenParam=""
pathSuffix="-${PER_PAGE}"
if [ -n "$pageToken" -a "$pageToken" != "" ]; then
pageTokenParam="&pageToken=$1"
pathSuffix="${pathSuffix}-${pageToken}"
fi
url="$api_url$pageTokenParam"
debug "API url to call: $url"
filename="$(buildResponseFilename "$blog_id" "$pathSuffix" "$(timestamp_date)" "json")"
path="$(ensure_path "data/blog_post_urls" "$filename")"
if [ ! -f "$path" ]; then
debug "Retrieving $url and caching the JSON at $path"
curl "$url" > "$path"
else
debug "Local cached copy already present: $path"
fi
echo "$path"
}
function getPageToken {
reponsePath="$1"
if [ -f "$responsePath" ]; then
debug "Looking for next pageToken: getPageToken(\$responsePath=${reponsePath}) { cat "$responsePath" | jq -r '.nextPageToken' }"
cat "$responsePath" | jq -r '.nextPageToken'
fi
}
# Requesting initial page:
responsePath=$(getResponsePath)
while :
do
# Check if there is another page; if you crawled the last page previously, then there won't be an .items item.
items=$(cat "$responsePath" | jq -rc '.items')
if [ -z "$items" -o "$items" == null -o "$items" == "" ]; then
break
fi
debug "Found the following URLs:"
cat "$responsePath" | jq -rc '.items[] | .url' || (echo "${0}: Error while retrieving urls for $responsePath" 1>&2 && exit 255)
# Look up the next pageToken from the JSON result
pageToken=$(getPageToken "$responsePath")
debug "Next pageToken: $pageToken"
if [ -z "$pageToken" -o "$pageToken" == null -o "$pageToken" == "" ]; then
break
fi
# Retrieve the next page of JSON results
responsePath=$(getResponsePath "$pageToken")
# Pptional throttling in seconds
sleep $REQUEST_THROTTLE
done
aggregatePath="data/blog_post_urls/$(buildResponseFilename "$blog_id" "-${PER_PAGE}" "$(timestamp_date)" "txt")"
cat "data/blog_post_urls/${blog_id}-${PER_PAGE}"*"-$(timestamp_date).json" | jq -rc '.items[] | .url' > "$aggregatePath" || (echo "${0}: Error while saving all post urls for $aggregatePath for blog with id ${blog_id} with ${PER_PAGE} items per page, cached on $(timestamp_date)" 1>&2 && exit 255)
debug "Saved all post urls to: $aggregatePath"
#!/usr/bin/env bash
# encoding: utf-8
api_url="https://apis.google.com/u/0/_/widget/render/comments?first_party_property=BLOGGER&query="
stdin=$(cat)
source ./_functions.sh
while IFS= read -r post_url
do
domain=$(domain_from_url "$post_url" | sanitise_filename)
url="${api_url}${post_url}"
debug "Looking for G+ Comments for $post_url"
filename="$(path_from_url "$post_url" | sanitise_filename)"
debug "Filename: $filename"
path="$(ensure_path "data/comments_frames/$domain" "$filename")"
if [ ! -f "$path" ]; then
debug "Storing comments for $post_url to $path and sleeping for $REQUEST_THROTTLE seconds."
sleep $REQUEST_THROTTLE
curl "$url" > "$path"
else
debug "Cache hit: G+ Comments Widget has already been retrieved from $url: to $path"
fi
echo "$path"
done <<< $stdin
@tech234a
Copy link

tech234a commented Feb 9, 2019

@FiXato Looks cool! A few of us are encountering the error ./getblogid.sh: line 3: source: _functions.sh: file not found when running the complete command you listed near the top of the README. What distro are you using? Any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment