Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Download Zeppelin Notebooks from S3 and organise into folders
import boto3
from pprint import pprint
import json
from pathlib import Path
region = 'ap-southeast-2'
s3client = boto3.client('s3', region_name=region)
paginator = s3client.get_paginator('list_objects')
bucket = 'my-bucket-name'
prefix = 'user/notebook/'
def callmelog(num_bytes: int) -> None:
pprint(num_bytes)
for result in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
for prefix in result.get('CommonPrefixes'):
commonprefix = prefix.get('Prefix')
name = commonprefix.split('/')[2]
key = commonprefix + 'note.json'
s3client.download_file(Bucket=bucket,
Key=key,
Filename = name + '.json',
Callback=callmelog)
p = Path('.')
documents = [document for document in p.iterdir() if document.suffix == '.json']
for document in documents:
pprint(document)
notebook_contents = json.loads(document.read_bytes())
notebook_name = (notebook_contents.get('name') + '.json')
notebook_path = Path(document.parent, notebook_name)
# I wish that pathlib would let you create/rename a file with parent directories
# This next line seems to handle it explicitly OK though
notebook_path.parents[0].mkdir(parents=True, exist_ok=True)
pprint(notebook_path)
document.rename(notebook_path)
from pathlib import Path
import json
from pprint import pprint
notebook_path = Path('./notebook_to_be_extracted.json')
extracted_path = Path('./extracted_notebook.txt')
log_queries = json.loads(notebook_path.read_bytes())
with extracted_path.open('w') as extf:
for paragraph in log_queries.get('paragraphs'):
title = paragraph.get('title','')
text = paragraph.get('text', '')
extf.write(title + '\n')
extf.write('---------------------------' + '\n')
extf.write(text + '\n')
@davoscollective

This comment has been minimized.

Copy link
Owner Author

commented Jun 26, 2018

I'd set up Zeppelin on AWS EMR using a configuration to store/backup all its Notebooks on S3. These are stored inside a directory named with the notebook ID (looks like 9 character string e.g. 2CVRRT2WC ) and inside each directory is a "note.json". The actual name of the notebook, including the pseudo-folder hierarchy that Zeppelin allows is stored as a value inside the json file. Makes it hard to retrieve code from the backup when you have a stack of files all named "note.json". These two scripts are useful to pull the files from S3, reconstruct the zeppelin hierarchy as local filesystem subdirectories, and rename the json files using the actual notebook name. The other script extracts the formatted code paragraphs into a text file. I did this to help migrating from Zeppelin on EMR to using Databricks notebooks.

Caveats: Some of the notebooks had leading or trailing spaces in their names and could potentially have other illegal characters that need cleaning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.