- Run (often created/edited from SMRT Link RunDesign, stored as XML)
- CollectionMetadata a Run has a list of Collection (Primary Analysis will convert a CollectionMetadata to a SubreadSet)
- PacBio DataSets SubreadSet, ReferenceSet, etc... These are thin-ish XML files that have general metadata as well as pointers to 'external resources' (e.g., BAM, Fasta files) and their companion index files.
- SMRT Link Job A general (async) unit of work to perform operations on PacBio DataSets
- ** DataStoreFile** a container for output files from a
SMRT Link Job
and contains metadata, such as file type, size, path. A list ofDataStore Files
is called aDataStore
. This is the core output ofSMRT Link Job
. - ** Report** a Report is general model to capture
Report
metrics (also referred to as 'Attributes'),Report Tables
andReport Plot Groups
. AReport
is a specific type ofDataStoreFile
and are used to communicate details of aSMRT Link Job
to theSMRT Link UI
(and webservices.)
Second tier models, such as Report View Rules, or Pipeline View Rules are not discussed here.
ICS/PA takes a Run XML with a list of Collections, converts each CollectionMeta into a SubreadSet. The SubreadSet is copied from ICS/PA file system into the customer storage on NFS (accessible by the companion SMRT Link instance) and the SubreadSet XML is imported into SMRT Link using the import-dataset
Job type in SMRT Link. The Reports for the SubreadSet XML emitted from the import-dataset
job show up in RunQC as well as in DataManagement in SMRT Link.
Show below is a sketch of the dataflow.
Simplify, the general interface of a SMRT Link Job, for type T,
A Job takes T as input and produces a PB (T -> Job -> DataStore
)
List of EntryPoint PB DataSet -> Job -> DataStore
A DataStore is a list of DataStore files.
Each DataStoreFile can be a different file types, such as, PB DataSet, VCF, ReportJSON, Fasta, etc... and also contains the specific ob id and UUID that generated the DataStoreFile.
During and after SMRT Link Job execution, the DataStoreFiles will be imorted into the db, the DataStoreFile. For a specific subset of file types (PB DataSet types), additional metadata will be stored in the SMRT Link database. Each DataSet has metadata about the specific dataset type as well as metadata about a possible 'parent' DataSet. The DataSet 'parentage' can be a result from copying, merging, analysis (the semantics are not consistent).
Each ReportJson file type contains a list of PB Dataset UUIDs in the data model. This is used to communicate which DataSets are specific to the input(s) of a specific ReportJSON. Alternatively said, the EntryPoint PB DataSet(s) might not be directly used to compute the ReportJson* datastore file..
NOTE, the dotted arrow represents the relation between the Report and the source input for the task at the Report JSON level. This is NOT captured at the SMRT Link Server level.
Accessing the Reports and the source DataSet is clearly defined here by only depending on the Job Id.
I believe the Merge DataSet Job type is Similar.
To perform a standard Resequencing Job, the user can run two different import-dataset
SMRT Link Jobs, then a pbsmrtpipe
(i.e., 'Analysis') SMRT Link Job can be performed.
Steps:
- Import SubreadSet
- Import ReferenceSet
- Run Analysis Job to run the Resequencing Analysis
(Each Job type is shown in its own box)
To demonstrate a larger dataflow example, consider the following case. A user would like to import SubreadSet alpha and beta, perform filtering on beta, merge the datasets, perform a Resequencing analysis on the merged subreadset and export the filtered SubreadSet as a zip.
Steps:
- Import ReferenceSet, SubreadSet alpha and Beta
- Create a filtered SubreadSet from SubreadSet alpha
- Create a Merged SubreadSet from SubreadSet Beta and the output of #2
- Create an Analysis Job using #3 and ReferenceSet from #1
- Create a DataSet XML(s) ZIP from the output of #3
This demonstrates graph nature of the design and composibility of different SMRT Link Job types
. Note that data provenance is for free in the model.