This is the full description of the proposal referred to by GitHub issue #6 in the PDBX/mmCIF working group repository on the PDBx mmCIF dictionary extension for diffraction data sets.
Below is an example of the definition of the processing of a complex data collection, based on PDB entry 4MU9. It uses (with some modifications) the proposed PDBx mmCIF dictionary extension for diffraction data sets. The image data are available from the JCSG. The data collection was a 3-wavelength MAD experiment made up of 5 sweeps. Going by the image timestamps the first two sweeps were wavelength-interleaved with a wedge size of 15° (=60 images of 0.25°). For illustrative purposes I show an interleaving scheme that minimises the number of wavelength changes, leading to most of the scans being 30°, although these data may not have been collected in exactly that way in reality. The remaining three sweeps were each collected in a single scan at the third wavelength.
This example does not contain any processing results (statistics, cell parameters etc.) or reflection data, since the main issue here is the definition of the data sections, and of the relationships between them and with the data collection. I understand the true meaning of a data section to be a step in a data processing path that produces data for structure solution and refinement from diffraction images. This means that the definitions of the data sections in any particular case are defined by the processing application or workflow. The contents of a data section as specified by the pdbx_diffrn_data_section_contents
category are the output of that processing step. The data sections below reflect the steps taken by autoPROC when processing the images from the JCSG; another application may do things differently. (I find the description of the pdbx_diffrn_data_section
category rather tautological: as we work on this it would be useful to try to come up with a new description for this fundamental category.)
In summary, the suggestions for changes arising directly from this example are:
- Change the name of the
pdbx_diffrn_merge_image_list
category so that it does not contain the word 'merge', perhaps topdbx_diffrn_data_section_image_list
. The reason for this is that a selection of a subset of images from a set of scans might be done in a processing step that does not do any merging, if the images are detected as being unusable. (Reasons for images being unusable include: too few spots, completely blank, missing from the filesystem, instrument settings wrong.) Also for this category:- Change
item.mandatory_code
of thedata_section_id
item to'implicit'
- Replace the
crystal_id
item with ascan_id
item
- Change
- Change
item.mandatory_code
of_pdbx_diffrn_data_section_index.data_section_id
to'implicit'
- The item
_pdbx_diffrn_merge_wavelength_list.id
should be understood to have global scope (defined by a mechanism to be decided on). - Introduce a new data item
_pdbx_diffrn_scan.wavelength_id
.
In this example, I have used the PDB id 4mu9
as part of various identifiers and names. Obviously, a data file that is generated by a data processing application prior to deposition will use a more arbitrary identifier.
This datablock needs no changes from the one in the current example from the mmCIF WG repository.
data_DIFFRN-PDB_000014mu9
_entry.id pdb_000014mu9
loop_
_pdbx_diffrn_data_section_correspondence.diffrn_id
_pdbx_diffrn_data_section_correspondence.data_section_id
1 '4mu9_merged'
#
loop_
_pdbx_diffrn_data_section_contents.data_section_id
_pdbx_diffrn_data_section_contents.content_type
4mu9_merged 'X-ray structure factor amplitudes'
4mu9_merged 'X-ray calculated amplitudes'
4mu9_merged 'X-ray calculated phases'
151759_1_E1 'X-ray unmerged intensities'
151759_1_E2 'X-ray unmerged intensities'
151759_2 'X-ray unmerged intensities'
151759_3 'X-ray unmerged intensities'
151759_4 'X-ray unmerged intensities'
# ... audit details omitted ...
This datablock specifies the result of the merging step, as well as the wavelengths that were used in the data collection.
data_4mu9_merged
_pdbx_diffrn_data_section.id '4mu9_merged'
_pdbx_diffrn_data_section.type_scattering 'x-ray'
_pdbx_diffrn_data_section.type_merged 'true'
_pdbx_diffrn_data_section.type_scaled 'true'
# _pdbx_diffrn_data_section_index.data_section_id not needed here if specified to be implicit
loop_
_pdbx_diffrn_data_section_index.parent_data_section_id
151759_1_E1
151759_1_E2
151759_2
151759_3
151759_4
loop_
_pdbx_diffrn_merge_wavelength_list.id # Has global scope: see scan definitions below
_pdbx_diffrn_merge_wavelength_list.wavelength
1 0.979470
2 0.918370
3 0.978760
The names of the unmerged data sections are derived from the image filename prefixes, and are generated by autoPROC in this case.
The first of the two interleaved data sections is defined as follows:
data_151759_1_E1
_pdbx_diffrn_data_section.id '151759_1_E1'
_pdbx_diffrn_data_section.type_scattering 'x-ray'
_pdbx_diffrn_data_section.type_merged 'false'
_pdbx_diffrn_data_section.type_scaled 'true'
# This category may be better named 'pdbx_diffrn_data_section_image_list'
# _pdbx_diffrn_merge_image_list.data_section_id # _item.mandatory_code should be 'implicit'
loop_
_pdbx_diffrn_merge_image_list.scan_id # substitute for crystal id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
# The first and last images from this set of scans were not processed
1_E1_01 2 60 # Image 1 unusable
1_E1_02 61 180
1_E1_03 181 300
1_E1_04 301 420
1_E1_05 421 540
1_E1_06 541 660
1_E1_07 661 719 # Image 720 unusable
loop_
_pdbx_diffrn_scan.scan_id
_pdbx_diffrn_scan.crystal_id
_pdbx_diffrn_scan.image_id_begin
_pdbx_diffrn_scan.image_id_end
_pdbx_diffrn_scan.scan_angle_begin
_pdbx_diffrn_scan.scan_angle_end
_pdbx_diffrn_scan.wavelength_id
# Wavelength-interleaved scans
1_E1_01 1 1 60 323. 338. 1
1_E1_02 1 61 180 338. 368. 1
1_E1_03 1 181 300 368. 398. 1
1_E1_04 1 301 420 398. 428. 1
1_E1_05 1 421 540 428. 458. 1
1_E1_06 1 541 660 458. 488. 1
1_E1_07 1 661 720 488. 503. 1
The second interleaved data section is very similar, differing in the wavelength and the details of the scans:
data_151759_1_E2
_pdbx_diffrn_data_section.id '151759_1_E2'
_pdbx_diffrn_data_section.type_scattering 'x-ray'
_pdbx_diffrn_data_section.type_merged 'false'
_pdbx_diffrn_data_section.type_scaled 'true'
# This category may be better named 'pdbx_diffrn_data_section_image_list'
# _pdbx_diffrn_merge_image_list.data_section_id # _item.mandatory_code should be 'implicit'
loop_
_pdbx_diffrn_merge_image_list.scan_id # substitute for crystal id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
# The first image from this set of scans was not processed
1_E2_01 2 120 # Image 1 unusable
1_E2_02 121 240
1_E2_03 241 360
1_E2_04 361 480
1_E2_05 481 600
1_E2_06 601 720
loop_
_pdbx_diffrn_scan.scan_id
_pdbx_diffrn_scan.crystal_id
_pdbx_diffrn_scan.image_id_begin
_pdbx_diffrn_scan.image_id_end
_pdbx_diffrn_scan.scan_angle_begin
_pdbx_diffrn_scan.scan_angle_end
_pdbx_diffrn_scan.wavelength_id
# Wavelength-interleaved scans
1_E2_01 1 1 120 323. 353. 2
1_E2_02 1 121 240 353. 383. 2
1_E2_03 1 241 360 383. 413. 2
1_E2_04 1 361 480 413. 443. 2
1_E2_05 1 481 600 443. 473. 2
1_E2_06 1 601 720 473. 503. 2
These are much simpler than the interleaved ones, only one of the three is shown. This example illustrates how a chunk of unusable images from the middle of the scan is represented.
data_151759_2
_pdbx_diffrn_data_section.id '151759_2'
_pdbx_diffrn_data_section.type_scattering 'x-ray'
_pdbx_diffrn_data_section.type_merged 'false'
_pdbx_diffrn_data_section.type_scaled 'true'
# This category may be better named 'pdbx_diffrn_data_section_image_list'
# _pdbx_diffrn_merge_image_list.data_section_id # _item.mandatory_code should be 'implicit'
loop_
_pdbx_diffrn_merge_image_list.scan_id # substitute for crystal id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
2_01 1 443
# Images 444-491 inclusive are unusable and have been excluded from processing
2_01 492 720
_pdbx_diffrn_scan.scan_id 2_01
_pdbx_diffrn_scan.crystal_id 1
_pdbx_diffrn_scan.image_id_begin 1
_pdbx_diffrn_scan.image_id_end 720
_pdbx_diffrn_scan.scan_angle_begin 323.
_pdbx_diffrn_scan.scan_angle_end 503.
_pdbx_diffrn_scan.wavelength_id 3
N.B. Please make any comments on this proposal at the linked GitHub issue