bittremieux/qcml_remarks_telco20170818.md

## qcml_remarks_telco20170818.md

      
    Raw
  

              qcml_remarks_telco20170818.md
            
          
    Some remarks on qcML XML schema version 0.0.10 and its associated handcrafted example file prior to the HUPO-PSI QC working group monthly teleconference of August 18, 2017.
General schema remarks

Reference to external files


The empty FileFormat element with a cvParam child to hold the actual value looks a bit weird imo. I saw that it was done in a similar fashion for mzIdentML, but I don't really see the need to do it like that. Instead the way it is done in mzML, where this information is stored directly in a cvParam without nesting it in a (redundant?) parent element, seems more logical to me. Is there an advantage of having a FileFormat element even though it doesn't contain any additional information over its child cvParam?


I don't understand what ExternalFormatDocumentation should be used for. The description says:

A URI to access documentation and tools to interpret the external format of the ExternalData instance. For example, XML Schema or static libraries (APIs) to access binary formats.

This is rather vague, to which kind of information should the URI point? Should it contain information on APIs that can be used to process file types it corresponds to, or to APIs that have been used to process that particular file?


Which brings me to the next point...
Provenance


We need some more detailed information on software that has been used in the prior pipeline to process all input files (so underlying each individual referenced file?) as well as the software that was used to generate the qcML file itself.
This can already be done using cvParams, but I think it's a good idea to make this more obvious with custom elements to contain this information, similarly to how information on the specific input files is explicitly encoded.
Ideally this should also include all relevant parameter information for each tool to make the whole processing pipeline fully reproducible.


We still need to include information on the contact person for the qcML file. Again, I think this information is important enough to warrant a custom parent element which will contain the necessary information such as name, email address, organization, ... as children cvParams.


Other


I'm not sure whether the content of a QualityMetric should be a CVParamType? This seems a bit too complex for me. Instead I think it would be useful if we could encode the fact that it will contain JSON data a bit more strongly, i.e. by only retaining relevant attributes to fully describe the JSON data and omitting the other attributes a cvParam contains further.


What is the difference between cvParam and userParam? userParams probably won't be able to be interpreted as it can contain any type of value without a reference to a controlled vocabulary. Do we need and/or want this? Which information do we need to store in the qcML file that doesn't have a corresponding reference in our CV?
Also, if we need some flexible parameter, is this the best way to do it, or is a cvParam with a reference to a very general CV term an alternative?


I think it's due to different requirements between different versions of the schema, but the IdentifiableType could be used more broadly. It only has an ID attribute, while several other Types have a separate ID as well. These could probably derive from IdentifiableType in some way as well. We should evaluate which information is needed where and possibly restructure the different types a bit (as multiple inheritance isn't possible) so all relevant attributes are always present while keeping unnecessary complexity to a minimum.


Remarks on hand-crafted example file


I think the qualityParameters with ID METRIC002_1 and METRIC002_3 are two examples of how to encode the same information, right? In that case I prefer the second option (METRIC002_3), but this can still be made more general. As we have discussed we won't define directly whether data is divided into quartiles, but instead allow general quantiles with the number of quantiles derived from the number of data points.
In that case it's also crucial to give all data points. So a quartile should always have 4 data points, even though one of these points might contain some redundant information or can be non-existing (in which case a NaN should be used). For this example it should be a quadruple then instead of a triplet, with the final value 1. This value contains of course hardly any useful information, but it removes potential ambiguities.


There are multiple accessions for the content elements: "n - tuple", "value at k-tile", "triplet", which are to some extent the same. I don't understand the difference between them and when one is used instead of the other. Why do we need these different options instead of a single "n-tuple" with its value specifying the number of data points?


Although it's not the main purpose of this example file, some housekeeping still needs to be done to bring it in line with the XML Schema as several element names are different. For example spectrumFileReference <-> RawFile, qualityParameter <-> qualityMetric, ...