Skip to content

Instantly share code, notes, and snippets.

@Jwink3101
Created December 17, 2020 22:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Jwink3101/c531b0e1f47504ea528dc4da9716b8de to your computer and use it in GitHub Desktop.
Save Jwink3101/c531b0e1f47504ea528dc4da9716b8de to your computer and use it in GitHub Desktop.
Demonstration of extracting from a Backblaze Restor
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using rclone to \"extract\" Backblaze Zip Snapshots and Reupload to B2\n",
"\n",
"This is a ~~guide~~ demonstration of how I use rclone to expand the contents of a Backblaze snapshot on B2 into another B2 bucket.\n",
"\n",
"## Questions and Answers\n",
"\n",
"### Who is this for?\n",
"\n",
"Me! Seriously, I wrote this for my own recall/notes in the future but I thought I'd share it\n",
"\n",
"To really answer the question, this is for people who want to do something similar and can use this as a guide. It is not a \"tool\" per se. It is not designed to be an easy or user-friendly process.\n",
"\n",
"I use Python to do it on a VPS. Python is super readable so it should be easy enough to (lightly) customize if you don't know Python. I would say this demonstration is for people who are willing to play around and learn it. It is *not* turn-key.\n",
"\n",
"### Can I use Windows?\n",
"\n",
"No idea! I am using a Debian VPS and my restore was from a macOS backup. I suspect any tool would work\n",
"\n",
"### What software do you need\n",
"\n",
"You need rclone and FUSE so you can rclone mount. This is not a guide on either of those. \n",
"\n",
"I am also assuming you've **already** set up rclone with B2 and/or an additional remote.\n",
"\n",
"I also use the awsome `tqdm` library but you can ignore that if you don't want it.\n",
"\n",
"### Will this cost money\n",
"\n",
"Yes! You will be downloading from B2 so you pay egress. It is also *very inneficient*. I have no idea how bad but I'd imagine it isn't great! So expect to pay more egress than you're actually using.\n",
"\n",
"### Why do this vs downloading the entire zip file?\n",
"\n",
"My test restore is small but my main use is for a 200+gb restore. I want to use my VPS's bandwidth but my VPS is small! (~10gb free). So while I am paying more for egress than, say, downloading the restore (and especially more than if I were to request a USB drive or download right from Backblaze Personal), it saves me the bandwidth.\n",
"\n",
"### What is this for\n",
"\n",
"Besides just putting it into its own B2 bucket, this process is useful to seed a different backup tool (including rclone, but really any)\n",
"\n",
"### Can I filter it\n",
"\n",
"Yes! There are two places. The first and best is to filter which `files` you include. The second is with rclone filters but I do *not* suggest that as you waste the time and expense to extract the files.\n",
"\n",
"### This could be done better\n",
"\n",
"I bet! Please share. I like learning new things. This is just what I worked out!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os,sys\n",
"import shutil\n",
"import time\n",
"import subprocess\n",
"import operator\n",
"import signal\n",
"from pathlib import Path\n",
"from zipfile import ZipFile\n",
"\n",
"from tqdm import tqdm # This is 3rd party. $ python -m pip install tqdm"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rclone v1.53.3\n",
"- os/arch: linux/amd64\n",
"- go version: go1.15.5\n",
"\n",
"3.8.3 (default, Jul 2 2020, 16:21:59)\n",
"[GCC 7.3.0]",
"\n"
]
}
],
"source": [
"print(subprocess.check_output(['rclone','version']).decode())\n",
"print(sys.version)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mount the restore bucket\n",
"\n",
"Here we mount the restore bucket. Note, do **not** add any caching unless you have the scratch space. Since my restore is bigger than my free space, I do not! This is basically a super vanilla rclone mount. In fact, when I tested with different advanced options, it failed.\n",
"\n",
"There are two ways to do this. The first is to use a new terminal and create the mount there. That works fine but I will instead do it all within Python and `subprocess`. With subprcocess, the arguments are passed as a list. This is actually really great since you do not have to deal with escaping. And it's easier to comment! If you do run it on a seperate terminal, `screen` is your friend."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"mountdir = Path('~/mount').expanduser()\n",
"mountdir.mkdir(exist_ok=True)\n",
"\n",
"rclone_remote = 'b2:b2-snapshots-7f7799daad93/' # already set up B2. Found the bucket with `rclone lsf b2:`\n",
"restore_zip = 'bzsnapshot_2020-12-17-07-06-19.zip' # found with `rclone lsf b2:b2-snapshots-7f7799daad93/`"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"cmd = ['rclone',\n",
" '-vv', # Optional but may be useful later\n",
" 'mount',rclone_remote,str(mountdir),\n",
" '--read-only',]\n",
"stdout,stderr = open('stdout','wb'),open('stderr','wb') # writable in bytes mode. I usually use context managers but I will need this to stay open\n",
"mount_proc = subprocess.Popen(cmd,stdout=stdout,stderr=stderr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Make sure it mounted. This is optional"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Waiting for mount \n",
"......mounted\n"
]
}
],
"source": [
"print('Waiting for mount ',flush=True)\n",
"for ii in range(10):\n",
" if os.path.ismount(mountdir):\n",
" break\n",
" if mount_proc.poll() is not None:\n",
" raise ValueError('did not mount')\n",
" time.sleep(1)\n",
" print('.',end='',flush=True)\n",
"else:\n",
" print('ERROR: Mount did not activate. Kill proc and exiting',file=sys.stderr,flush=True)\n",
" mount_proc.kill()\n",
" sys.exit(2)\n",
"print('mounted')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Browse the Zip\n",
"\n",
"Python's `zipfile` will **not** read the entire file in order to get a listing or even some random file inside. Don't believe me? See the bottom!\n",
"\n",
"What we need to do now is get a list of the files and use manual inspection to decide what to cut. Backblaze uses the full path"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"with ZipFile(mountdir/restore_zip) as zf:\n",
" files = zf.infolist() # could also do namelist() but we will want the sizes later"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2149"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(files)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pick a random file to get the path. We will use this later"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Macintosh HD/Users/jwinkMAC/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"files[1000].filename"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Identify and save the prefix as you want it removed"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"restore_prefix = 'Macintosh HD/Users/jwinkMAC/' # We will need this later to reupload. This s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Restore a single file!\n",
"\n",
"This is actually super easy! Just search though `files` to find the file you want. Let's assume it is the 1000th file still"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"restore_file = files[1000]\n",
"\n",
"restore_dir = Path('~/restore').expanduser()\n",
"with ZipFile(mountdir/restore_zip) as zf:\n",
" zf.extract(restore_file,path=str(restore_dir))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inside the zip file is the full prefixed file (from root). I don't want that"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PosixPath('/home/jwink3101/restore/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Optional. Remove prefix\n",
"src = restore_dir / restore_file.filename\n",
"dst = restore_dir / os.path.relpath(src,restore_dir / restore_prefix)\n",
"dst.parent.mkdir(parents=True,exist_ok=True)\n",
"shutil.move(src,dst)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extract and Upload\n",
"\n",
"Now, this could almost *certainly* use improvement. We will do the following:\n",
"\n",
"- Gather files up to the max batch size. Then for each batch:\n",
"- Delete the restore directory\n",
"- Restore the batched files\n",
"- Do an rclone `copy` (*not* `sync`) to push those files\n",
" - Need to make the source at the `restore_prefix` so we do not keep that junk\n",
" \n",
"Note that we may be able to optimize this by better backfilling the batches but I am not sure if there is any advantages with sequential reading so I will go one file after the other. It may be moot."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Tool to gather the files into batches\n",
"def group_to_size(seq,maxsize,key=None):\n",
" \"\"\"\n",
" Group seq by size up to but not to exceed \n",
" maxsize (unless a single item does)\n",
" \n",
" Example:\n",
" >>> list(group_to_size([10,20,10,90,40,50,99,2,101,0,30,90,11],100))\n",
" [(10, 20, 10), (90,), (40, 50), (99,), (2,), (101,), (0, 30), (90,), (11,)]\n",
" \n",
" \"\"\"\n",
" s = 0\n",
" curr = []\n",
" for item in seq:\n",
" s0 = key(item) if callable(key) else item\n",
" if s + s0 > maxsize: # Yield if will be pushed over\n",
" yield tuple(curr)\n",
" curr = []\n",
" s = 0\n",
" s += s0\n",
" curr.append(item)\n",
" if curr: \n",
" yield tuple(curr) # Anything remaining"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"maxsize = 512 * 1024 * 1024 # 512 mb or 536870912 bytes\n",
"\n",
"# dest_remote = 'b2:mynewbuckets/whatever'\n",
"dest_remote = '/home/jwink3101/restore/tmp/'\n",
"\n",
"scratch = Path('~/scratch').expanduser().absolute()\n",
"scratch.mkdir(parents=True,exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# This is there you can filter stuff\n",
"# filtered = (f for f in files if ...)\n",
"\n",
"filtered = files # No filter"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0%| | 0/185 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"batch 0 # files 185\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 185/185 [00:30<00:00, 6.16it/s]\n",
" 0%| | 0/146 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 1 # files 146\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 146/146 [00:32<00:00, 4.47it/s]\n",
" 0%| | 0/97 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 2 # files 97\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 97/97 [00:29<00:00, 3.24it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 1%| | 2/187 [00:00<00:12, 15.09it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"batch 3 # files 187\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 187/187 [00:31<00:00, 5.87it/s]\n",
" 0%| | 0/126 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 4 # files 126\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 126/126 [00:28<00:00, 4.39it/s]\n",
" 0%| | 0/141 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 5 # files 141\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 141/141 [00:27<00:00, 5.05it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0%| | 0/78 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"batch 6 # files 78\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 78/78 [00:15<00:00, 4.90it/s]\n",
" 0%| | 0/54 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 7 # files 54\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 54/54 [00:29<00:00, 1.84it/s]\n",
" 0%| | 0/138 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 8 # files 138\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 138/138 [00:29<00:00, 4.68it/s]\n",
" 0%| | 0/730 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 9 # files 730\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 730/730 [00:29<00:00, 24.90it/s]\n",
" 0%| | 0/101 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 10 # files 101\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 101/101 [00:30<00:00, 3.30it/s]\n",
" 0%| | 0/137 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 11 # files 137\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 137/137 [00:28<00:00, 4.84it/s]\n",
" 0%| | 0/29 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n",
"batch 12 # files 29\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 29/29 [00:11<00:00, 2.58it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"calling rclone\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"batches = group_to_size(filtered,maxsize,key=operator.attrgetter('file_size'))\n",
"with ZipFile(mountdir/restore_zip) as zf:\n",
" for ib,batchfiles in enumerate(batches):\n",
" print('batch',ib,'# files',len(batchfiles))\n",
" # Extract all of the files\n",
" for file in tqdm(batchfiles):\n",
" zf.extract(file,path=str(scratch))\n",
" \n",
" print('calling rclone')\n",
" \n",
" cmd = ['rclone',\n",
" 'move', # use move so they get deleted\n",
" str(scratch / restore_prefix), dest_remote,\n",
" '--transfers','20', # and/or other flags. all optional.\n",
" ]\n",
" subprocess.check_call(cmd)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Unmount"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"mount_proc.send_signal(signal.SIGINT)\n",
"mount_proc.wait() # Hopefully this works. Otherwise you may need to kill it manually\n",
"stdout.close()\n",
"stderr.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Notes\n",
"\n",
"### ZipFile\n",
"\n",
"Python's ZipFile will read into a zip file without reading the entire file. It does need to \"seek\" in the file, hence the mount, but rclone handles that like a champ.\n",
"\n",
"How do I know I'm not downloading the entire file? Well, you could look at the rclone logs. The other way is to make a file-object that will be verbose about what's going on. Note that `ZipFile` takes either a filename *or* a file-like object"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"import io\n",
"class VerboseFile(io.FileIO):\n",
" def read(self,*args,**kwargs):\n",
" print('read',*args,**kwargs)\n",
" r = super(VerboseFile,self).read(*args,**kwargs)\n",
" print(' len:',len(r))\n",
" return r\n",
" def seek(self,*args,**kwargs):\n",
" print('seek',*args,**kwargs)\n",
" return super(VerboseFile,self).seek(*args,**kwargs)\n",
" def close(self,*args,**kwargs):\n",
" print('close')\n",
" return super(VerboseFile,self).close(*args,**kwargs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, insetad of \n",
"\n",
"```python\n",
"with ZipFile(mountdir/restore_zip) as zf:\n",
" ...\n",
"```\n",
"do\n",
"```python\n",
"with ZipFile(VerboseFile(mountdir/restore_zip)) as zf:\n",
" ...\n",
"``` \n",
"and you'll be able to see everything"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment