Skip to content

Instantly share code, notes, and snippets.

@amaltaro
Last active August 19, 2019 13:03
Show Gist options
  • Save amaltaro/72599f995b37a6e33566f3c749143154 to your computer and use it in GitHub Desktop.
Save amaltaro/72599f995b37a6e33566f3c749143154 to your computer and use it in GitHub Desktop.
Data structure for the MS Transferor document
# OPTION A:
{"wf_A": {"timestamp": 0000
"primary": ["list of transfer ids"],
"secondary": ["list of transfer ids"]},
"wf_B": {"timestamp": 0000
"primary": [],
"secondary": []},
}
# OPTION B:
{"wf_A": {"timestamp": 0000
"primary": {"dset_1": ["list of transfer ids"]},
"secondary": {"PU_dset_1": ["list of transfer ids"]},
"wf_B": {"timestamp": 0000
"primary": {"dset_1": ["list of transfer ids"],
"parent_dset_1": ["list of transfer ids"]},
"secondary": {"PU_dset_1": ["list of transfer ids"],
"PU_dset_2": ["list of transfer ids"]},
"wf_C": {"timestamp": 0000
"primary": {},
"secondary": {},
}
# OPTION C (the chosen one!) - it assumes we store all the transfer information within the same Couch document:
{"wf_A": [{"timestamp":000, "dataset":"/a/b/c", "dataType": "primary", "transferIDs": [1,2,3]},
"timestamp":000, "dataset":"/a/b/c", "dataType": "secondary", "transferIDs": [4]}],
"wf_B": [{"timestamp":000, "dataset":"/a/b/c", "dataType": "primary", "transferIDs": [1,2,3]},
"timestamp":000, "dataset":"/a/b/c", "dataType": "parent", "transferIDs": [4,5,6]}],
"wf_C": [],
}
# OPTION D - it assumes a new document is created for every request:
{"workflowName": "blah,
"lastUpdate": 000, # just as timestamp above
"transfers": [{"dataset":"/a/b/c", "dataType": "primary", "transferIDs": [1,2,3], "campaignName": "blah2017", "completion": [0.0]},
{"dataset":"/a/b/c", "dataType": "secondary", "transferIDs": [4], "campaignName": "blah2018", "completion": [0.0]},
{"dataset":"/a/b/c", "dataType": "parent", "transferIDs": [4,5,6], "campaignName": "blah2017", "completion": [0.0]}]
}
@amaltaro
Copy link
Author

amaltaro commented May 27, 2019

@vkuznet this last option looks reasonable. I added it to the list of formats as option C.
Just to clarify the fields:

  • timestamp: (integer type) timestamp for when this request + dataset was acted on (created/updated)
  • dataset: (string type) string with the dataset name (block names are not supported and it must be mapped to the dataset name)
  • dataType: (string type) string with the dataset type. It can be one of the following values primary | secondary | parent
  • transferIDs: (list of integer) list containing the transfer identification

As a general rule, a workflow can have the following data:
0-1 primary dataset, 0-1 parent dataset, 0-N secondary dataset (normally up to 2)

@vkuznet
Copy link

vkuznet commented May 27, 2019

In general timestamp should be float since we'll use time.time() and it is a float number. But of course we can cast it to int. Everything else is correct.

@amaltaro
Copy link
Author

@vkuznet Valentin, I created the Option D for the case where we want to store a new document for each workflow. I believe that's going to be our best option TBH.

@vkuznet
Copy link

vkuznet commented Jul 25, 2019

Alan, your option D is almost identical to my original proposal (the difference that I proposed records per each dataset and you group them for given workflow) and it is a good compromise, i.e. it represents a single entity (in this case workflow) and we do not need to compose gigantic single dictionary.

@amaltaro
Copy link
Author

Ok, let's hope nothing else changes. Let's proceed with option D then, one record/document per workflow.

@amaltaro
Copy link
Author

@vkuznet Valentin, I added a completion parameter to the option D, such that we can persist the transfer completion every time it gets calculated (and persist it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment