Skip to content

Instantly share code, notes, and snippets.

@pkeller
Last active October 19, 2020 14:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pkeller/b0b9f18c0a94564fe4eea65feb9b9bfd to your computer and use it in GitHub Desktop.
Save pkeller/b0b9f18c0a94564fe4eea65feb9b9bfd to your computer and use it in GitHub Desktop.
Archiving the final anisotropic data processing statistics in mmCIF

Archiving the final anisotropic data processing statistics in PDBx/mmCIF

In what follows, proposed new category and item names are shown with a prefix gphl.

Some of the changes suggested below are also applicable to isotropic data processing, however they have arisen from the consideration of anisotropic cases where they are particularly important.

Completeness

The mmCIF dictionary currently has one measure of completeness. For an entire data set it is _reflns.percent_possible_obs and for a resolution-defined shell it is _reflns_shell.percent_possible_obs. These items refer to data within two cut-off surfaces that are defined by low and high resolution limits, and are therefore spherical. An anisotropic treatment of the data requires the completeness to be calculated taking into account a non-spherical cut-off surface. Unobserved data outside the cut-off surface are not taken into account when calculating anisotropic completeness, so the resulting values differ from the conventional isotropic completeness.

We also note that the current PDBx dictionary does not seem to have items for anomalous completeness.

In order to allow for all four possible types of completeness to be archived, we propose extending the reflns category with the following three items:

  • _reflns.gphl_percent_possible_obs_anomalous
  • _refnls.gphl_percent_possible_obs_aniso
  • _reflns.gphl_percent_possible_obs_aniso_anomalous

and the reflns_shell category with the following three items:

  • _reflns_shell.gphl_percent_possible_obs_anomalous
  • _refnls_shell.gphl_percent_possible_obs_aniso
  • _reflns_shell.gphl_percent_possible_obs_aniso_anomalous

Data on the anisotropy

We suggest creating a new category to hold data that relates specifically to the degree of anisotropy, and linking it to the reflns category. The STARANISO output includes the following:

Diffraction limits & principal axes of ellipsoid fitted to diffraction cut-off surface:

                              1.887         0.9873   0.1573  -0.0204       0.950 _a_* + 0.305 _b_* + 0.065 _c_*
                              1.489        -0.1065   0.7526   0.6498      -0.075 _a_* + 0.679 _b_* + 0.730 _c_*
                              1.569         0.1176  -0.6394   0.7599       0.097 _a_* - 0.676 _b_* + 0.730 _c_*

Columns 2-4 are the axis directions of the ellipsoid expressed as an orthonormal set

    Fraction of data inside cut-off surface:        82.2%  (    45658 /    55540)
    
    Fraction of cut-off surface above threshold:    43.4%  (    1270 /    2927)
Beq:                               19.65    [ = equivalent overall isotropic B factor on Fs.]

                                               B11      B22      B33      B23      B31      B12
Delta-B tensor:                              12.10    -6.43    -5.67     2.44     0.86     6.20

This could be represented in PDBx/mmCIF as:

gphl_reflns_aniso.reflns_pdbx_ordinal    1       # child of _reflns.pdbx_ordinal
gphl_reflns_aniso.diffrn_limit_1          1.887
gphl_reflns_aniso.diffrn_limit_2          1.489
gphl_reflns_aniso.diffrn_limit_3          1.569

gphl_reflns_aniso.axis_1_ortho[1]  0.9873
gphl_reflns_aniso.axis_1_ortho[2]  0.1573
gphl_reflns_aniso.axis_1_ortho[3] -0.0204
gphl_reflns_aniso.axis_2_ortho[1] -0.1065
gphl_reflns_aniso.axis_2_ortho[2]  0.7526
gphl_reflns_aniso.axis_2_ortho[3]  0.6498
gphl_reflns_aniso.axis_3_ortho[1]  0.1176
gphl_reflns_aniso.axis_3_ortho[2] -0.6394
gphl_reflns_aniso.axis_3_ortho[3]  0.7599


gphl_reflns_aniso.axis_1_rcell[1]  0.868
gphl_reflns_aniso.axis_1_rcell[2]  0.451 
gphl_reflns_aniso.axis_1_rcell[3]  0.210
gphl_reflns_aniso.axis_2_rcell[1] -0.186
gphl_reflns_aniso.axis_2_rcell[2]  0.897
gphl_reflns_aniso.axis_2_rcell[3] -0.400
gphl_reflns_aniso.axis_3_rcell[1] -0.151
gphl_reflns_aniso.axis_3_rcell[2]  0.401
gphl_reflns_aniso.axis_3_rcell[3]  0.904

gphl_reflns_aniso.b[1][1]   31.75
gphl_reflns_aniso.b[2][2]   13.22
gphl_reflns_aniso.b[3][3]   13.98
gphl_reflns_aniso.b[2][3]   22.09
gphl_reflns_aniso.b[3][1]   20.51
gphl_reflns_aniso.b[1][2]   25.85

gphl_reflns_aniso.percent_data_inside_cutoff       82.2
gphl_reflns_aniso.percent_cutoff_above_threshold   43.4

where we archive the absolute B tensor, rather than Beq and the delta-B tensor.

All these items could be incorporated directly into the reflns category, if that is thought to be a better solution.

Reflection redundancy

There is currently no way of representing the individual redundancy of merged reflections. We propose extending the refln category with three new items to cater for this:

For centric reflections, only the first of these three items would be populated.

Including these redundancies would allow better interpretation of individual σ(I) values, and improved visualisation of the effects of detector module gaps, shadowing and cusps (even for data from multi-orientation data collections).

Reflection binning by statistical significance

STARANISO uses the local mean I/σ(I) as a measure of statistical significance, and calculates this value for each reflection. A cut-off surface of arbitrary shape is then defined based on a threshold value of this local mean I/σ(I). (For more details, see the STARANISO documentation). Views of the reciprocal lattice, binned and coloured by statistical significance are then produced by:

  • the WebGL viewer on the STARANISO server (in 3D)
  • autoPROC (as 2D plots of key projections)

An example of a 2D plot is:

p0r plot from reprocessing JCSG images for 4IB2
p0r plot from reprocessing JCSG images for 4IB2

We propose to archive the measure of statistical significance for each reflection, with an associated status to aid interpretation.

To allow the 2D and 3D plots to be reproduced (or other equivalent views to be generated) the binning determined by STARANISO is also required. Our suggestion on how to do this is as follows:

_gphl_refln_signal.criterion 'local(mean(I/sigI))'

loop_
_gphl_refln_signal_bin.upper_threshold
# The first threshold defines the lower limit of statistical
# significance, i.e. the cut-off surface
  1.20
  7.22
 19.00
 36.81
 48.87
 53.84
 57.69
 
 loop_
_refln.index_h
_refln.index_k
_refln.index_l
_refln.gphl_signal_status
_refln.gphl_signal
# ... other items omitted 
    1  -37    0    -    .         # Unobservable (grey)
    2  -33    0    <    .         # Observable but unmeasured (blue)
    1  -32    0    o    3.38      # Observed, with associated signal (red, orange, etc.)
  19    18    0    x    0.23      # Observed, but with high individual I/sig(I) (pink) 

In this example, the proposed item _refln.gphl_signal_status borrows some of its controlled vocabulary from _refln.status

Goodness of fit

The analysis implemented by STARANISO entails fitting an analytical function to the statistical significance: currently this fit is of an ellipsoid to the boundary between statistically significant and non-significant regions of reciprocal space. In the future, we would also like to archive a measure of the goodness of fit. We are currently considering how best to do this.

@drlemmus
Copy link

I really like the per-reflection redundancy/multiplicity.
If I understand correctly, _refln.gphl_signal can be calculated on _gphl_refln_signal_bin.upper_threshold and _refln.intensity_meas plus _refln.intensity_sigma. That would make it not primary data and therefore unnecessary to store.

@pkeller
Copy link
Author

pkeller commented Apr 20, 2020

I really like the per-reflection redundancy/multiplicity.
If I understand correctly, _refln.gphl_signal can be calculated on _gphl_refln_signal_bin.upper_threshold and _refln.intensity_meas plus _refln.intensity_sigma. That would make it not primary data and therefore unnecessary to store.

Thanks for the comment. Bear in mind that the signal is currently an application-specific quantity, so cannot be easily recalculated in the general case. This example uses the STARANISO-calculated signal which is not the I/sig(I) of an individual reflection, but the mean of I/Sig(I) over a set of reflections that are local to that reflection. This is the information contained in _gphl_refln_signal.criterion

@drlemmus
Copy link

If it cannot be reproduced straightforward and is application-specific than there is a good reason to define means of storing this. At the same time this raises questions on whether we would want to store it in the PDB databank (devil's advocate).

@pkeller
Copy link
Author

pkeller commented Apr 21, 2020

The signal is used by STARANISO to define the cut-off surface. (Note that the ellipsoid is not the cut-off surface, it is just the closest ellipsoidal fit to that cut-off surface). The problem with the argument that application-specific data should not be archived is this: when applications using new methods are developed (like STARANISO), how are you going to archive the data and parameters associated with their use if you are forced to express those data/parameters in the same terms as pre-existing applications that do things in a different way?

This proposal at least makes some attempt at generality. The per-reflection statistic is called _refln.gphl_signal, not _refln.gphl_local_mean_I_over_sigI, so could be used by other applications in the future that calculate a per-reflection significance. The nature of that significance can be specified with the _gphl_refln_signal.criterion item.

@drlemmus
Copy link

I agree that we sometimes need to store new things and it is very good to describe them properly in a new way if they are fundamentally different from what was used before. I think that all bases are covered in the proposal. I would propose a fixed dictionary for _gphl_refln_signal.criterion to ensure it stays machine readable.

My point was not about forcing this data into a non-fitting form (we should avoid this), it is about deciding whether or not we want to store it. A pro is that it is very useful to have, a con could be that it can be calculated on-the-fly with only the raw data and some limited meta data.

@pkeller
Copy link
Author

pkeller commented Apr 21, 2020

it can be calculated on-the-fly with only the raw data and some limited meta data.

Can it? This opens up a whole can of worms. To reproduce the values used by the depositor, the PDB would need to run STARANISO in the same way as the depositor did (which may be from within autoPROC or via the STARANISO server). In the future if the calculation of the signal is tweaked, they may also need to match the version of STARANISO that the depositor used. I don't think that this is practical. These values are not just data quality statistics - they are part of the definition of the data that go forward into refinement.

@GB-GPhL
Copy link

GB-GPhL commented Oct 19, 2020

I strongly agree with the viewpoint you express in your last comment, Peter, and especially in its last sentence.

What needs to be archived comprises
(1) data quality metrics that will enable PDB searches to filter entries by a broader range of relevant data-centric criteria than are available now (which would open vast new horizons, given how lame the data quality criteria currently archived are), and
(2) a summary of the data analysis criteria (that may be more elaborate and more program-specific than the archived ones) on the basis of which decisions were made in selecting the subset of those raw measurements eventually used as input to the refinement that produced the deposited model.

Note that the overall anisotropic B tensor determined by STARANISO as the first step in its analysis should be included in category (1), as it is a direct description of the anisotropy of the data that is purely based on Wilson statistics reshaped by |E|^2 profile information (and hence essentially model-independent), and is therefore distinct from both
(a) the recasting of the anisotropy in the form of distinct diffraction limits in different directions (as these can be influenced by redundancy and systematic incompleteness) and
(b) the anisotropic scaling B tensor coming out of model refinement.
Last but not least, it is also the source of the information used in the STARANISO anisotropy correction, i.e. the optional sharpening of the data in the weaker directions, applied to the final processed data prior to their input to refinement (something also done by the UCLA Diffraction Anisotropy Server).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment