Skip to content

Instantly share code, notes, and snippets.

@amaltaro
Last active August 19, 2019 13:03
Show Gist options
  • Save amaltaro/72599f995b37a6e33566f3c749143154 to your computer and use it in GitHub Desktop.
Save amaltaro/72599f995b37a6e33566f3c749143154 to your computer and use it in GitHub Desktop.
Data structure for the MS Transferor document
# OPTION A:
{"wf_A": {"timestamp": 0000
"primary": ["list of transfer ids"],
"secondary": ["list of transfer ids"]},
"wf_B": {"timestamp": 0000
"primary": [],
"secondary": []},
}
# OPTION B:
{"wf_A": {"timestamp": 0000
"primary": {"dset_1": ["list of transfer ids"]},
"secondary": {"PU_dset_1": ["list of transfer ids"]},
"wf_B": {"timestamp": 0000
"primary": {"dset_1": ["list of transfer ids"],
"parent_dset_1": ["list of transfer ids"]},
"secondary": {"PU_dset_1": ["list of transfer ids"],
"PU_dset_2": ["list of transfer ids"]},
"wf_C": {"timestamp": 0000
"primary": {},
"secondary": {},
}
# OPTION C (the chosen one!) - it assumes we store all the transfer information within the same Couch document:
{"wf_A": [{"timestamp":000, "dataset":"/a/b/c", "dataType": "primary", "transferIDs": [1,2,3]},
"timestamp":000, "dataset":"/a/b/c", "dataType": "secondary", "transferIDs": [4]}],
"wf_B": [{"timestamp":000, "dataset":"/a/b/c", "dataType": "primary", "transferIDs": [1,2,3]},
"timestamp":000, "dataset":"/a/b/c", "dataType": "parent", "transferIDs": [4,5,6]}],
"wf_C": [],
}
# OPTION D - it assumes a new document is created for every request:
{"workflowName": "blah,
"lastUpdate": 000, # just as timestamp above
"transfers": [{"dataset":"/a/b/c", "dataType": "primary", "transferIDs": [1,2,3], "campaignName": "blah2017", "completion": [0.0]},
{"dataset":"/a/b/c", "dataType": "secondary", "transferIDs": [4], "campaignName": "blah2018", "completion": [0.0]},
{"dataset":"/a/b/c", "dataType": "parent", "transferIDs": [4,5,6], "campaignName": "blah2017", "completion": [0.0]}]
}
@vkuznet
Copy link

vkuznet commented May 24, 2019

Alan, did you considered my proposal. If you still don't want to do separate records I have another suggest how to simplify usage of single record but have better flexibility to update/use it in a code. How about this format:

{
  "wf_A": [record1, record2, ...],
  "wf_B": [....],
}

where individual records will be of the form

{"timestamp":000, "dataset":"/a/b/c", "type": "primary", "trainsferIDs": [1,2,3]}

Doing this way we will atomically update since record in couchDB, while use individual records from this structure in a code. Since number of dataset per one workflow is small the overhead of looping through is negligible but it gives clear and consistent record representation to deal within a code.

@amaltaro
Copy link
Author

amaltaro commented May 27, 2019

@vkuznet this last option looks reasonable. I added it to the list of formats as option C.
Just to clarify the fields:

  • timestamp: (integer type) timestamp for when this request + dataset was acted on (created/updated)
  • dataset: (string type) string with the dataset name (block names are not supported and it must be mapped to the dataset name)
  • dataType: (string type) string with the dataset type. It can be one of the following values primary | secondary | parent
  • transferIDs: (list of integer) list containing the transfer identification

As a general rule, a workflow can have the following data:
0-1 primary dataset, 0-1 parent dataset, 0-N secondary dataset (normally up to 2)

@vkuznet
Copy link

vkuznet commented May 27, 2019

In general timestamp should be float since we'll use time.time() and it is a float number. But of course we can cast it to int. Everything else is correct.

@amaltaro
Copy link
Author

@vkuznet Valentin, I created the Option D for the case where we want to store a new document for each workflow. I believe that's going to be our best option TBH.

@vkuznet
Copy link

vkuznet commented Jul 25, 2019

Alan, your option D is almost identical to my original proposal (the difference that I proposed records per each dataset and you group them for given workflow) and it is a good compromise, i.e. it represents a single entity (in this case workflow) and we do not need to compose gigantic single dictionary.

@amaltaro
Copy link
Author

Ok, let's hope nothing else changes. Let's proceed with option D then, one record/document per workflow.

@amaltaro
Copy link
Author

@vkuznet Valentin, I added a completion parameter to the option D, such that we can persist the transfer completion every time it gets calculated (and persist it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment