Skip to content

Instantly share code, notes, and snippets.

@pkeller
Last active March 12, 2020 11:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pkeller/5441d15725fc2175e6bf0c5128229c94 to your computer and use it in GitHub Desktop.
Save pkeller/5441d15725fc2175e6bf0c5128229c94 to your computer and use it in GitHub Desktop.
Using the PDBx/mmCIF data Section extension to represent a complex data collection.

Introduction

This is the full description of the proposal referred to by GitHub issue #6 in the PDBX/mmCIF working group repository on the PDBx mmCIF dictionary extension for diffraction data sets.

Below is an example of the definition of the processing of a complex data collection, based on PDB entry 4MU9. It uses (with some modifications) the proposed PDBx mmCIF dictionary extension for diffraction data sets. The image data are available from the JCSG. The data collection was a 3-wavelength MAD experiment made up of 5 sweeps. Going by the image timestamps the first two sweeps were wavelength-interleaved with a wedge size of 15° (=60 images of 0.25°). For illustrative purposes I show an interleaving scheme that minimises the number of wavelength changes, leading to most of the scans being 30°, although these data may not have been collected in exactly that way in reality. The remaining three sweeps were each collected in a single scan at the third wavelength.

This example does not contain any processing results (statistics, cell parameters etc.) or reflection data, since the main issue here is the definition of the data sections, and of the relationships between them and with the data collection. I understand the true meaning of a data section to be a step in a data processing path that produces data for structure solution and refinement from diffraction images. This means that the definitions of the data sections in any particular case are defined by the processing application or workflow. The contents of a data section as specified by the pdbx_diffrn_data_section_contents category are the output of that processing step. The data sections below reflect the steps taken by autoPROC when processing the images from the JCSG; another application may do things differently. (I find the description of the pdbx_diffrn_data_section category rather tautological: as we work on this it would be useful to try to come up with a new description for this fundamental category.)

Proposed changes to the mmCIF dictionary

In summary, the suggestions for changes arising directly from this example are:

  • Change the name of the pdbx_diffrn_merge_image_list category so that it does not contain the word 'merge', perhaps to pdbx_diffrn_data_section_image_list. The reason for this is that a selection of a subset of images from a set of scans might be done in a processing step that does not do any merging, if the images are detected as being unusable. (Reasons for images being unusable include: too few spots, completely blank, missing from the filesystem, instrument settings wrong.) Also for this category:
    • Change item.mandatory_code of the data_section_id item to 'implicit'
    • Replace the crystal_id item with a scan_id item
  • Change item.mandatory_code of _pdbx_diffrn_data_section_index.data_section_id to 'implicit'
  • The item _pdbx_diffrn_merge_wavelength_list.id should be understood to have global scope (defined by a mechanism to be decided on).
  • Introduce a new data item _pdbx_diffrn_scan.wavelength_id.

The proposed mmCIF datablock by datablock

In this example, I have used the PDB id 4mu9 as part of various identifiers and names. Obviously, a data file that is generated by a data processing application prior to deposition will use a more arbitrary identifier.

The top-level "ToC" block

This datablock needs no changes from the one in the current example from the mmCIF WG repository.

data_DIFFRN-PDB_000014mu9

_entry.id pdb_000014mu9

loop_
_pdbx_diffrn_data_section_correspondence.diffrn_id
_pdbx_diffrn_data_section_correspondence.data_section_id
1 '4mu9_merged'

#
loop_
_pdbx_diffrn_data_section_contents.data_section_id
_pdbx_diffrn_data_section_contents.content_type

4mu9_merged     'X-ray structure factor amplitudes'
4mu9_merged     'X-ray calculated amplitudes'
4mu9_merged     'X-ray calculated phases'
151759_1_E1     'X-ray unmerged intensities'
151759_1_E2     'X-ray unmerged intensities'
151759_2        'X-ray unmerged intensities'
151759_3        'X-ray unmerged intensities'
151759_4        'X-ray unmerged intensities'

# ... audit details omitted ...

The merged data section

This datablock specifies the result of the merging step, as well as the wavelengths that were used in the data collection.

data_4mu9_merged

 _pdbx_diffrn_data_section.id              '4mu9_merged'
 _pdbx_diffrn_data_section.type_scattering 'x-ray'
 _pdbx_diffrn_data_section.type_merged     'true'
 _pdbx_diffrn_data_section.type_scaled     'true'


# _pdbx_diffrn_data_section_index.data_section_id not needed here if specified to be implicit

loop_
_pdbx_diffrn_data_section_index.parent_data_section_id
 151759_1_E1
 151759_1_E2
 151759_2   
 151759_3   
 151759_4   
 
loop_
_pdbx_diffrn_merge_wavelength_list.id           # Has global scope: see scan definitions below
_pdbx_diffrn_merge_wavelength_list.wavelength
  1   0.979470
  2   0.918370
  3   0.978760

The names of the unmerged data sections are derived from the image filename prefixes, and are generated by autoPROC in this case.

Unmerged data sections: wavelength interleaved

The first of the two interleaved data sections is defined as follows:

data_151759_1_E1

 _pdbx_diffrn_data_section.id              '151759_1_E1'
 _pdbx_diffrn_data_section.type_scattering 'x-ray'
 _pdbx_diffrn_data_section.type_merged     'false'
 _pdbx_diffrn_data_section.type_scaled     'true'


# This category may be better named 'pdbx_diffrn_data_section_image_list'
# _pdbx_diffrn_merge_image_list.data_section_id   # _item.mandatory_code should be 'implicit'

loop_
_pdbx_diffrn_merge_image_list.scan_id             # substitute for crystal id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
  # The first and last images from this set of scans were not processed
  1_E1_01     2    60                            # Image 1 unusable
  1_E1_02    61   180
  1_E1_03   181   300
  1_E1_04   301   420
  1_E1_05   421   540
  1_E1_06   541   660
  1_E1_07   661   719                            # Image 720 unusable
  
loop_
_pdbx_diffrn_scan.scan_id
_pdbx_diffrn_scan.crystal_id
_pdbx_diffrn_scan.image_id_begin
_pdbx_diffrn_scan.image_id_end
_pdbx_diffrn_scan.scan_angle_begin
_pdbx_diffrn_scan.scan_angle_end
_pdbx_diffrn_scan.wavelength_id

  # Wavelength-interleaved scans
  1_E1_01   1   1   60    323.    338.   1
  1_E1_02   1  61  180    338.    368.   1
  1_E1_03   1 181  300    368.    398.   1
  1_E1_04   1 301  420    398.    428.   1
  1_E1_05   1 421  540    428.    458.   1
  1_E1_06   1 541  660    458.    488.   1
  1_E1_07   1 661  720    488.    503.   1

The second interleaved data section is very similar, differing in the wavelength and the details of the scans:

data_151759_1_E2

 _pdbx_diffrn_data_section.id              '151759_1_E2'
 _pdbx_diffrn_data_section.type_scattering 'x-ray'
 _pdbx_diffrn_data_section.type_merged     'false'
 _pdbx_diffrn_data_section.type_scaled     'true'


# This category may be better named 'pdbx_diffrn_data_section_image_list'
# _pdbx_diffrn_merge_image_list.data_section_id   # _item.mandatory_code should be 'implicit'

loop_
_pdbx_diffrn_merge_image_list.scan_id             # substitute for crystal id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
  # The first image from this set of scans was not processed
  1_E2_01     2   120                            # Image 1 unusable
  1_E2_02   121   240
  1_E2_03   241   360
  1_E2_04   361   480
  1_E2_05   481   600
  1_E2_06   601   720
  
loop_
_pdbx_diffrn_scan.scan_id
_pdbx_diffrn_scan.crystal_id
_pdbx_diffrn_scan.image_id_begin
_pdbx_diffrn_scan.image_id_end
_pdbx_diffrn_scan.scan_angle_begin
_pdbx_diffrn_scan.scan_angle_end
_pdbx_diffrn_scan.wavelength_id

  # Wavelength-interleaved scans
  1_E2_01   1   1  120    323.    353.   2
  1_E2_02   1 121  240    353.    383.   2
  1_E2_03   1 241  360    383.    413.   2
  1_E2_04   1 361  480    413.    443.   2
  1_E2_05   1 481  600    443.    473.   2
  1_E2_06   1 601  720    473.    503.   2

Unmerged data sections: non-interleaved

These are much simpler than the interleaved ones, only one of the three is shown. This example illustrates how a chunk of unusable images from the middle of the scan is represented.

data_151759_2

 _pdbx_diffrn_data_section.id              '151759_2'
 _pdbx_diffrn_data_section.type_scattering 'x-ray'
 _pdbx_diffrn_data_section.type_merged     'false'
 _pdbx_diffrn_data_section.type_scaled     'true'


# This category may be better named 'pdbx_diffrn_data_section_image_list'
# _pdbx_diffrn_merge_image_list.data_section_id   # _item.mandatory_code should be 'implicit'

loop_
_pdbx_diffrn_merge_image_list.scan_id             # substitute for crystal id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
  2_01   1  443
  # Images 444-491 inclusive are unusable and have been excluded from processing
  2_01 492  720
  
_pdbx_diffrn_scan.scan_id             2_01
_pdbx_diffrn_scan.crystal_id             1
_pdbx_diffrn_scan.image_id_begin         1
_pdbx_diffrn_scan.image_id_end         720
_pdbx_diffrn_scan.scan_angle_begin     323.
_pdbx_diffrn_scan.scan_angle_end       503.
_pdbx_diffrn_scan.wavelength_id          3

N.B. Please make any comments on this proposal at the linked GitHub issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment