-
-
Save emdupre/3cb4d564511d495ea6bf89c6a577da74 to your computer and use it in GitHub Desktop.
import json | |
import re | |
import requests | |
repo = '5hju4' # my example repository, update as appropriate | |
query = '?filter[name][contains]=confounds_regressors.tsv' # my example query, update or leave blank as appropriate | |
url = 'https://api.osf.io/v2/nodes/{0}/files/osfstorage/{1}'.format(repo, query) | |
guids = [] | |
while True: | |
resp = requests.get(url) | |
resp.raise_for_status() | |
data = json.loads(resp.content) | |
for i in data['data']: | |
sub = re.search(r'sub-(\S+)_task', i['attributes']['name']).group(1) | |
guids.append((sub, i['id'])) | |
url = data['links']['next'] | |
if url is None: | |
break |
Hi @nicofarr !
It looks like you currently have Dropbox, Gitlab, and OSFStorage as providers in that repository. Just to be clear, the code here is only going to pull from OSF storage ! If that's the behavior you're looking for, the second thing to note is that while your OSFStorage is structured such that all the data files are in subfolders, the code here assumes a flat directory structure. Which means we'll need to add another layer to this, in addition to updating the query. So, I'd suggest the following three immediate changes:
- Update
repo = 'h285u'
- Set
query = ''
- Change
sub = re.search(r'sub-(\S+)_task', i['attributes']['name']).group(1)
tosub = i['attributes']['name']
This should then return the hashes for each of the individual folders (along with the folder names) in guids. Of note, it only seems to return
10 folder names for me. Are you expecting more to have data ?
Then you'd need to query those hashes to get the associated contents. Let me know if that's in line with what you're looking for !
Hi @emdupre,
Thanks, when doing those changes I do get 384 guids, corresponding to our folders.
Now, how do you query the hashes to get the contents of each folder ?
BTW where did you find this info ? Is this from the API guide on the OSF website ? We re having a hard time extracting the relevant info from this doc...
HI @emdupre,
I'm trying to reuse this for another codebase, we have a (large) OSF repo (https://osf.io/h285u/ ) and we'd like to pull all the keys to fill a dictionnary, that we can later filter to download either all or a subset of files.
In some of the nilearn code (nilearn/datasets/func.py) it seems like the index file (associating urls with hashes) was built at some point, and is also downloaded from osf (https://github.com/nilearn/nilearn/blob/master/nilearn/datasets/func.py#L847 ), before being parsed to extract just the hash part, and forge the url to fetch only the necessary files (https://github.com/nilearn/nilearn/blob/master/nilearn/datasets/func.py#L904).
The question is : how do you build the index (pair url : hash ) ? I suppose the code here does it, but I'm not sure how to format the query.