Skip to content

Instantly share code, notes, and snippets.

@emdupre
Created April 1, 2019 16:59
Show Gist options
  • Save emdupre/3cb4d564511d495ea6bf89c6a577da74 to your computer and use it in GitHub Desktop.
Save emdupre/3cb4d564511d495ea6bf89c6a577da74 to your computer and use it in GitHub Desktop.
Pull download keys for all files listed in an OSF repository
import json
import re
import requests
repo = '5hju4' # my example repository, update as appropriate
query = '?filter[name][contains]=confounds_regressors.tsv' # my example query, update or leave blank as appropriate
url = 'https://api.osf.io/v2/nodes/{0}/files/osfstorage/{1}'.format(repo, query)
guids = []
while True:
resp = requests.get(url)
resp.raise_for_status()
data = json.loads(resp.content)
for i in data['data']:
sub = re.search(r'sub-(\S+)_task', i['attributes']['name']).group(1)
guids.append((sub, i['id']))
url = data['links']['next']
if url is None:
break
@nicofarr
Copy link

HI @emdupre,
I'm trying to reuse this for another codebase, we have a (large) OSF repo (https://osf.io/h285u/ ) and we'd like to pull all the keys to fill a dictionnary, that we can later filter to download either all or a subset of files.
In some of the nilearn code (nilearn/datasets/func.py) it seems like the index file (associating urls with hashes) was built at some point, and is also downloaded from osf (https://github.com/nilearn/nilearn/blob/master/nilearn/datasets/func.py#L847 ), before being parsed to extract just the hash part, and forge the url to fetch only the necessary files (https://github.com/nilearn/nilearn/blob/master/nilearn/datasets/func.py#L904).

The question is : how do you build the index (pair url : hash ) ? I suppose the code here does it, but I'm not sure how to format the query.

@emdupre
Copy link
Author

emdupre commented Jul 15, 2020

Hi @nicofarr !

It looks like you currently have Dropbox, Gitlab, and OSFStorage as providers in that repository. Just to be clear, the code here is only going to pull from OSF storage ! If that's the behavior you're looking for, the second thing to note is that while your OSFStorage is structured such that all the data files are in subfolders, the code here assumes a flat directory structure. Which means we'll need to add another layer to this, in addition to updating the query. So, I'd suggest the following three immediate changes:

  1. Update repo = 'h285u'
  2. Set query = ''
  3. Change sub = re.search(r'sub-(\S+)_task', i['attributes']['name']).group(1) to sub = i['attributes']['name']

This should then return the hashes for each of the individual folders (along with the folder names) in guids. Of note, it only seems to return
10 folder names for me. Are you expecting more to have data ?

Then you'd need to query those hashes to get the associated contents. Let me know if that's in line with what you're looking for !

@nicofarr
Copy link

Hi @emdupre,
Thanks, when doing those changes I do get 384 guids, corresponding to our folders.

Now, how do you query the hashes to get the contents of each folder ?

BTW where did you find this info ? Is this from the API guide on the OSF website ? We re having a hard time extracting the relevant info from this doc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment