Skip to content

Instantly share code, notes, and snippets.

@calebcase
Last active February 3, 2020 16:05
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save calebcase/1267dd8492eaea29d953e0d5e3b6d35e to your computer and use it in GitHub Desktop.
Save calebcase/1267dd8492eaea29d953e0d5e3b6d35e to your computer and use it in GitHub Desktop.

Prepare datasets

Zero files

#!/bin/bash
set -euo pipefail

mkdir zero
for i in {1..1000000}; do
  touch "zero/$i.data"
done

3k files

#!/bin/bash
set -euo pipefail

mkdir 3k
for i in {1..1000000}; do
  truncate -s 3k "3k/$i.data"
done

Others

Follow the gomnia example to generate file size of a particular distribution and total size. Once completed:

#!/bin/bash
set -euo pipefail

mkdir 1TB

i=0
while read size; do
  truncate -s "$size" "1TB/$i.data"
  ((i+=1))
done

Install rclone with the storj patch

Download for Linux:

https://github.com/calebcase/rclone/releases/tag/v1.50.2-362-g28d7db32-feature-storj-beta

Or you can build it yourself:

git clone https://github.com/calebcase/rclone
cd rclone
git checkout feature/storj
go build

Configure rclone

Run rclone config and follow the interactive prompts. You will need a scope/access from uplink setup or uplink share.

For example, my config for the atlanta cluster contains something like:

[atlanta]
type = storj
scope = supersecretscope
skip-peer-ca-whitelist = true

Make target bucket

rclone mkdir atlanta:test

Upload

Create an upload script with the following:

#!/bin/bash
set -euo pipefail

site=${1?site name}
dataset=${2?path to dataset}
attempt=${3?attempt number}
concurrency=${4:-64}

date -u
time rclone --transfers $concurrency -v \
  copy $dataset $site:test/$dataset.$attempt &>> $site.$dataset.$attempt.log

This will copy the local directory $dataset to the backend $site. Invoke it like this:

upload atlanta zero 1

You should end up with a local log file atlanta.zero.1.log.

Errors

Check the upload for errors as it progress. In particular we are interested in timeout events.

Create an errors script with the following:

#!/bin/bash
set -euo pipefail

site=${1?site name}
dataset=${2?path to dataset}
attempt=${3?attempt number}

(
  printf 'Now: %s\n' "$(date -u --iso=s)"

  general=$(
    (grep ERROR $site.$dataset.$attempt.log || true) |
      (grep -v 'already closed' || true) |
      wc -l
  )
  printf 'General: %d\n' "$general"

  timeouts=$(
    (grep ERROR $site.$dataset.$attempt.log || true) |
      (grep -v 'already closed' || true) |
      (grep 'timed out waiting on copy' || true) |
      wc -l
  )
  printf 'Timeouts: %d\n' "$timeouts"
) | column -t

(grep ERROR $site.$dataset.$attempt.log || true) |
  (grep -v 'already closed' || true) |
  (grep 'timed out waiting on copy' || true) |
  awk '{print $1 " " $2 " CET"}' |
  xargs -I{} date -u --iso=m -d {} |
  uniq -c

Invoke errors like this:

./errors atlanta 3k 1

You should see output like this:

Now:       2020-01-21T12:16:23+00:00
General:   0
Timeouts:  0

Listings

Create an listings script with the following:

#!/bin/bash
set -euo pipefail

site=${1?site name}
dataset=${2?path to dataset}
attempt=${3?attempt number}

printf 'Recursive Listing\n'
date -u
time rclone ls $site:test/$dataset.$attempt | wc -l

printf '\nNon-recursive Listing\n'
date -u
time rclone lsf $site:test/$dataset.$attempt | wc -l

Invoke listing like this:

./listing atlanta 3k 1

You should see output like this:

Recursive Listing
Tue 21 Jan 2020 11:45:11 AM UTC
87744

real    1m23.694s
user    0m8.545s
sys     0m1.133s

Non-recursive Listing
Tue 21 Jan 2020 11:46:35 AM UTC
90945

real    0m58.549s
user    0m8.117s
sys     0m0.456s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment