wojdyr/regarding-sf-mmcif-extension.txt

## regarding-sf-mmcif-extension.txt
Dear All,

I read carefully the proposal. I'm involved in handling SF-mmCIF from
both sides: preparing files for deposition and adding support for
reflection mmCIF files in programs such as Aimless. So I took time to
think about the proposal, to check examples, check how unmerged data
is handled in imgCIF, how it is currently stored in the _diffrn_refln
category in the 328 PDB entries that use this category, how it is
stored in different formats that we will need to convert between (MTZ,
XDS ASCII) and, to get a wider perspective, over the last months I
asked questions about the data deposition to various people.

Currently, the main blockers for depositing more of useful data is
(1) that the software used in OneDep supports only part of the current
    specification (missing essential bits) and
  (1a) it's not documented what exactly is supported,
  (1b) it can't be easily checked by trials and errors (but I'm aware
       that the plan is to move sf_convert to a public repository and
       then this will be possible),
(2) the unmerged data description in the current spec is also missing
    important things.

The proposal is a complete overhaul of the SF-mmCIF files. It improves
on (2), but it adds a lot of complexity that will slow down (1) and
also hamper using SF-mmCIF by other programs.
Overall, adopting the proposal would delay the deposition of (more
meaningful) unmerged data by months or years.

I appreciate writing the proposal took a great deal of effort. In
every such project the knowledge gained in the process of writing is
more important than the written text. In my opinion, to make the
deposition of unmerged data widespread in a reasonable time, we should
take the knowledge but drop the proposal. And instead, focus on the
gradual improvement to the current specification.

The best thing in the proposal is that it adds annotations on the
image level (currently, properties such as the wavelength or phi angle
are linked to individual reflections, which is not ideal). But the
same could be done by adding a tag such as _diffrn_refln.frame_id to
the current spec -- that's a tag from imgCIF.

From what I understand, the main intended benefit of the proposal is
what was called "containerization" of the data. Each block is
explicitly marked as
  type_merged='true'/'false'
and
  type_scaled='true'/'false'
and the correspondence between merged and unmerged data is recorded.

The distinction between merged and unmerged data is already clear
because different categories are used to describe both.

The scaled/unscaled clarification is indeed
missing in the current spec. Again, a simpler solution could be used:
document _diffrn_refln.intensity_net as scaled (which is how it is
used in most of the PDB entries) and, if needed, add a new tag such as
_diffrn_refln.unscaled_intensity.

The correspondence between datasets should be more explicit,
but this also can be done in a backward-compatible way.
The most important thing to ensure data consistency would be validation
(software again) that checks if the unmerged data corresponds to the
merged one.

Another change that the proposal introduces is making tags more
descriptive. Reflection tables _refln and _diffrn_refln are renamed to
_pdbx_diffrn_merged_refln and _pdbx_diffrn_unmerged_refln.
I appreciate informative names, but I don't think that their benefit
outweighs backward compatibility. (extra bonus from the current
naming: it's similar to what is used for small molecules).

I try to keep this email short and I focus only on the good parts of
the proposal. There are also questionable things. Some points have
been raised in the Issues section of the proposal and on every meeting
someone reminds that these points wait to be addressed.
But what I'm arguing for is changing the approach. Instead of the
long-discussed waterfall change, make smaller, iterative improvements
to the current specification and adapt the OneDep software at the
same time. First remove road-blockers. This way we will start getting
unmerged depositions quickly, developers will start to use them for
validation and for method development, and we will be able to make
better-informed decisions about next changes.

The first iteration could look like this:
0a) formally indicate which categories in the spec are for reflection
    files - by moving them into a separate DDL file,
0b) make a list of categories/tags that are never used in the PDB
    archive and are not supported by the software, ask the WG which
    can be useful, remove the rest.
1) add a single new tag for "centroid of image numbers that recorded
   the Bragg peak" (as XDS docs call it).

Kind regards,
Marcin
	Dear All,

	I read carefully the proposal. I'm involved in handling SF-mmCIF from
	both sides: preparing files for deposition and adding support for
	reflection mmCIF files in programs such as Aimless. So I took time to
	think about the proposal, to check examples, check how unmerged data
	is handled in imgCIF, how it is currently stored in the _diffrn_refln
	category in the 328 PDB entries that use this category, how it is
	stored in different formats that we will need to convert between (MTZ,
	XDS ASCII) and, to get a wider perspective, over the last months I
	asked questions about the data deposition to various people.

	Currently, the main blockers for depositing more of useful data is
	(1) that the software used in OneDep supports only part of the current
	specification (missing essential bits) and
	(1a) it's not documented what exactly is supported,
	(1b) it can't be easily checked by trials and errors (but I'm aware
	that the plan is to move sf_convert to a public repository and
	then this will be possible),
	(2) the unmerged data description in the current spec is also missing
	important things.

	The proposal is a complete overhaul of the SF-mmCIF files. It improves
	on (2), but it adds a lot of complexity that will slow down (1) and
	also hamper using SF-mmCIF by other programs.
	Overall, adopting the proposal would delay the deposition of (more
	meaningful) unmerged data by months or years.

	I appreciate writing the proposal took a great deal of effort. In
	every such project the knowledge gained in the process of writing is
	more important than the written text. In my opinion, to make the
	deposition of unmerged data widespread in a reasonable time, we should
	take the knowledge but drop the proposal. And instead, focus on the
	gradual improvement to the current specification.

	The best thing in the proposal is that it adds annotations on the
	image level (currently, properties such as the wavelength or phi angle
	are linked to individual reflections, which is not ideal). But the
	same could be done by adding a tag such as _diffrn_refln.frame_id to
	the current spec -- that's a tag from imgCIF.

	From what I understand, the main intended benefit of the proposal is
	what was called "containerization" of the data. Each block is
	explicitly marked as
	type_merged='true'/'false'
	and
	type_scaled='true'/'false'
	and the correspondence between merged and unmerged data is recorded.

	The distinction between merged and unmerged data is already clear
	because different categories are used to describe both.

	The scaled/unscaled clarification is indeed
	missing in the current spec. Again, a simpler solution could be used:
	document _diffrn_refln.intensity_net as scaled (which is how it is
	used in most of the PDB entries) and, if needed, add a new tag such as
	_diffrn_refln.unscaled_intensity.

	The correspondence between datasets should be more explicit,
	but this also can be done in a backward-compatible way.
	The most important thing to ensure data consistency would be validation
	(software again) that checks if the unmerged data corresponds to the
	merged one.

	Another change that the proposal introduces is making tags more
	descriptive. Reflection tables _refln and _diffrn_refln are renamed to
	_pdbx_diffrn_merged_refln and _pdbx_diffrn_unmerged_refln.
	I appreciate informative names, but I don't think that their benefit
	outweighs backward compatibility. (extra bonus from the current
	naming: it's similar to what is used for small molecules).

	I try to keep this email short and I focus only on the good parts of
	the proposal. There are also questionable things. Some points have
	been raised in the Issues section of the proposal and on every meeting
	someone reminds that these points wait to be addressed.
	But what I'm arguing for is changing the approach. Instead of the
	long-discussed waterfall change, make smaller, iterative improvements
	to the current specification and adapt the OneDep software at the
	same time. First remove road-blockers. This way we will start getting
	unmerged depositions quickly, developers will start to use them for
	validation and for method development, and we will be able to make
	better-informed decisions about next changes.

	The first iteration could look like this:
	0a) formally indicate which categories in the spec are for reflection
	files - by moving them into a separate DDL file,
	0b) make a list of categories/tags that are never used in the PDB
	archive and are not supported by the software, ask the WG which
	can be useful, remove the rest.
	1) add a single new tag for "centroid of image numbers that recorded
	the Bragg peak" (as XDS docs call it).

	Kind regards,
	Marcin