Skip to content

Instantly share code, notes, and snippets.

@wojdyr
Created October 19, 2020 14:02
Show Gist options
  • Save wojdyr/acd6d227db6574d69dee7c4af17ef63c to your computer and use it in GitHub Desktop.
Save wojdyr/acd6d227db6574d69dee7c4af17ef63c to your computer and use it in GitHub Desktop.
my thoughts about diffrn-data-set-extension
Dear All,
I read carefully the proposal. I'm involved in handling SF-mmCIF from
both sides: preparing files for deposition and adding support for
reflection mmCIF files in programs such as Aimless. So I took time to
think about the proposal, to check examples, check how unmerged data
is handled in imgCIF, how it is currently stored in the _diffrn_refln
category in the 328 PDB entries that use this category, how it is
stored in different formats that we will need to convert between (MTZ,
XDS ASCII) and, to get a wider perspective, over the last months I
asked questions about the data deposition to various people.
Currently, the main blockers for depositing more of useful data is
(1) that the software used in OneDep supports only part of the current
specification (missing essential bits) and
(1a) it's not documented what exactly is supported,
(1b) it can't be easily checked by trials and errors (but I'm aware
that the plan is to move sf_convert to a public repository and
then this will be possible),
(2) the unmerged data description in the current spec is also missing
important things.
The proposal is a complete overhaul of the SF-mmCIF files. It improves
on (2), but it adds a lot of complexity that will slow down (1) and
also hamper using SF-mmCIF by other programs.
Overall, adopting the proposal would delay the deposition of (more
meaningful) unmerged data by months or years.
I appreciate writing the proposal took a great deal of effort. In
every such project the knowledge gained in the process of writing is
more important than the written text. In my opinion, to make the
deposition of unmerged data widespread in a reasonable time, we should
take the knowledge but drop the proposal. And instead, focus on the
gradual improvement to the current specification.
The best thing in the proposal is that it adds annotations on the
image level (currently, properties such as the wavelength or phi angle
are linked to individual reflections, which is not ideal). But the
same could be done by adding a tag such as _diffrn_refln.frame_id to
the current spec -- that's a tag from imgCIF.
From what I understand, the main intended benefit of the proposal is
what was called "containerization" of the data. Each block is
explicitly marked as
type_merged='true'/'false'
and
type_scaled='true'/'false'
and the correspondence between merged and unmerged data is recorded.
The distinction between merged and unmerged data is already clear
because different categories are used to describe both.
The scaled/unscaled clarification is indeed
missing in the current spec. Again, a simpler solution could be used:
document _diffrn_refln.intensity_net as scaled (which is how it is
used in most of the PDB entries) and, if needed, add a new tag such as
_diffrn_refln.unscaled_intensity.
The correspondence between datasets should be more explicit,
but this also can be done in a backward-compatible way.
The most important thing to ensure data consistency would be validation
(software again) that checks if the unmerged data corresponds to the
merged one.
Another change that the proposal introduces is making tags more
descriptive. Reflection tables _refln and _diffrn_refln are renamed to
_pdbx_diffrn_merged_refln and _pdbx_diffrn_unmerged_refln.
I appreciate informative names, but I don't think that their benefit
outweighs backward compatibility. (extra bonus from the current
naming: it's similar to what is used for small molecules).
I try to keep this email short and I focus only on the good parts of
the proposal. There are also questionable things. Some points have
been raised in the Issues section of the proposal and on every meeting
someone reminds that these points wait to be addressed.
But what I'm arguing for is changing the approach. Instead of the
long-discussed waterfall change, make smaller, iterative improvements
to the current specification and adapt the OneDep software at the
same time. First remove road-blockers. This way we will start getting
unmerged depositions quickly, developers will start to use them for
validation and for method development, and we will be able to make
better-informed decisions about next changes.
The first iteration could look like this:
0a) formally indicate which categories in the spec are for reflection
files - by moving them into a separate DDL file,
0b) make a list of categories/tags that are never used in the PDB
archive and are not supported by the software, ask the WG which
can be useful, remove the rest.
1) add a single new tag for "centroid of image numbers that recorded
the Bragg peak" (as XDS docs call it).
Kind regards,
Marcin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment