This is a github gist that contains scripts, made available for everyone to reproduce data processing in the publication.
In deposited datasets, organization is the following:
C2_L_C2221
+-- 001_01_01
+-- 002_01_02
| +-- images
| | +-- 002_01_02_0[0-100].cbf
| | +-- x_geo_corr.cbf
| | +-- y_geo_corr.cbf
| +-- XDS.INP
| +-- XDS.INP.modified
| +-- crystallization.txt
+-- 003_02_01
...
+-- express.py
+-- fin.csv
+-- reject.sh
+-- create_express_inp.py
+-- xdscc.py
+-- xdscc12
+-- rating.py
+-- XSCALE.express.py.INP
Each folder is numbered as XXX_YY_ZZ_NN, where:
- XXX -- consequtive number in the datasest
- YY -- crystallization conditions ID
- ZZ -- loop number withing same crystallization conditinos
- NN -- number of miniset within loop
create_xscale_inp.py
-- creates fixed-name fin.csv
table, implying folder structure as described above. The table is then used by express.py
as input file. Each line in fin.csv
represents one miniset. It contains following columns (in that order): name of the folder to be created by express.py
for all XDS-related files in this datasets, location of XDS.INP
file for this dataset, location of raw data for this dataset, number of images in this dataset.
Usage:
python create_xscale_inp.py
For the first integration, one has to have input file fin.csv
ready (see above) and filtered, if necessary. Only the scripts express.py
and reject.sh
are necessary.
express.py
-- Given a list of folders with XDS.INP and path to respective data sets, the script runs XDS for all the data sets in list, optionally adding UNIT_CELL_CONSTANTS, SPACE_GROUP_NUMBER, INCLUDE_RESOLUTION_RANGE, setting SPOT_RANGE same as DATA_RANGE, and setting REFERENCE_DATA_SET. Adds MAXIMUM_NUMBER_OF_PROCESSORS and MAXIMUM_NUMBER_OF_JOBS for processing on large clusters. Runs xscale_par afterwards.
data_summary_table = 'fin.csv'
space_group = '!SPACE_GROUP_NUMBER= 1 \n'
unit_cell_constants = '!UNIT_CELL_CONSTANTS= 30 40 50 90 90 90\n'
max_num_proc = 'MAXIMUM_NUMBER_OF_PROCESSORS= 80 \n'
max_num_jobs = 'MAXIMUM_NUMBER_OF_JOBS= 48 \n'
resolution_range = 'INCLUDE_RESOLUTION_RANGE= 30 3.0 \n'
#reference_data_set = 'REFERENCE_DATA_SET= %s \n'%'../reference.HKL'
reference_data_set = '!REFERENCE_DATA_SET= %s \n'%'../reference.HKL'
use_reference = True
Input parameters are to set by manually editing the script. If one want to use REFERENCE_DATA_SET
keyword during integration, uncomment line with REFERENCE_DATA_SET
in code and comment prefious one. By default, file reference.HKL
in folder with express.py
is used as reference.
Usage:
# for short data processing runs
python express.py
# for long run, where you may want to log off from the processing server and keep log file
python express.py |& tee log.express.py_$(date "+%Y_%m_%d_%H_%M") & disown
# to kill mistakinly started processing
pkill -f express.py; pkill -f xds_par; pkill -f xscale_par
reject.sh
Merges all XDS_ASCII.HKL files in subfolders of current folder, optionally choosing those that have particular space group. Then iteratively runs deltaCC12 rejection with given resolution range and number of cycles. Saves all intermediate XSCALE.INP-s and XSCALE.LP-s.
Usage:
bash reject.sh
using default configuration (only part of the script is shown):
# will run 4 cycles of rejection
for i in `seq 1 1 4`; do
xscale_par
cp XSCALE.LP{,_$i}
# will make deltaCC rejection in 32.0-10.0 resolutoin range with 5 bins
./xdscc12 scaled_nonmerged.HKL -dmin 32.0 -dmax 10 -nbin 5 > XDSCC.LP
# will analyse output file XDSCC.LP
# and write to good.xdscc names of only those minisets,
# which have deltaCC12 higher then 2.0
python xdscc.py XDSCC.LP 2.0 |& tee log.xdscc_"$i"
Optionally, you may want to inspect XSCALE.LP tables for all datasets and take the best one as reference or for further processing:
grep 'Nano' -A 25 XSCALE.LP_* | less
# choose best one, e.g. XSCALE.LP_2
cp xscale.inp_2 XSCALE.INP
xscale_par; cp scaled_nonmerged.HKL reference.HKL
You can also take only minisets with certain space group as initial input for further deltaCC rejection:
# comment this line
# ls */XDS_ASCII.HKL > xscale.inp
# and uncomment this -- here the space group is in first `grep`, number 22
grep SPACE_GROUP_NUMBER */XDS_ASCII.HKL | grep "22$" | tr ":" " " | awk '{print $1}' > xscale.inp
xdscc.py
Analyses output of xdscc utility together with last XSCALE.INP used, providing the list of datasets with their deltaCC12 values. Saves listgood.xdscc
of those which have deltaCC12 higher than input value.
Usage:
# run xdscc12, which is executable and located in current folder
# using scaled_nonmerged file (produced with XSCALE using MERGE=FALSE)
# in resolution range 40.0-2.8 and 13 bins
./xdscc12 scaled_nonmerged.HKL -dmin 40.0 -dmax 2.8 -nbin 13 > XDSCC.LP
# analyse output file XDSCC.LP, produced by xdscc12
# and write go good.xdscc only filenames with
# deltaCC > 3.0
python xdscc.py XDSCC.LP 3.0
This will give output of following kind:
1 056_03_02_01/XDS_ASCII.HKL -5.75
3 101_05_02_05/XDS_ASCII.HKL -0.86
-- -------------------------- 0.00
4 106_05_03_04/XDS_ASCII.HKL 0.11
7 109_05_03_07/XDS_ASCII.HKL 0.35
5 107_05_03_05/XDS_ASCII.HKL 1.23
10 112_05_03_10/XDS_ASCII.HKL 1.60
8 110_05_03_08/XDS_ASCII.HKL 3.31
12 191_13_01_01/XDS_ASCII.HKL 4.35
9 111_05_03_09/XDS_ASCII.HKL 4.40
11 113_05_03_11/XDS_ASCII.HKL 4.69
6 108_05_03_06/XDS_ASCII.HKL 7.66
13 196_13_02_03/XDS_ASCII.HKL 8.45
15 204_18_01_01/XDS_ASCII.HKL 10.39
2 090_04_02_02/XDS_ASCII.HKL 11.33
14 203_17_01_01/XDS_ASCII.HKL 24.51
and write following good.xdscc
:
110_05_03_08/XDS_ASCII.HKL
191_13_01_01/XDS_ASCII.HKL
111_05_03_09/XDS_ASCII.HKL
113_05_03_11/XDS_ASCII.HKL
108_05_03_06/XDS_ASCII.HKL
196_13_02_03/XDS_ASCII.HKL
204_18_01_01/XDS_ASCII.HKL
090_04_02_02/XDS_ASCII.HKL
203_17_01_01/XDS_ASCII.HKL
For second integration, you usually update express.py
to have initial unit cell constants as in your reference dataset, and also increase resolution range (if you see that your reference data set has potential for it):
express.py
grep '^!UNIT_CELL_CONSTANTS' reference.HKL
>UNIT_CELL_CONSTANTS= 59.22 45.66 86.77 90.000 91.275 90.000
grep SPACE_GROUP_NUMBER reference.HKL
>!SPACE_GROUP_NUMBER= 4
Modify express.py
input parameters:
data_summary_table = 'fin.csv'
space_group = 'SPACE_GROUP_NUMBER= 4 \n'
unit_cell_constants = 'UNIT_CELL_CONSTANTS= 59.22 45.66 86.77 90.000 91.275 90.000\n'
max_num_proc = 'MAXIMUM_NUMBER_OF_PROCESSORS= 80 \n'
max_num_jobs = 'MAXIMUM_NUMBER_OF_JOBS= 48 \n'
resolution_range = 'INCLUDE_RESOLUTION_RANGE= 30 2.5 \n'
And run it:
python express.py |& tee log.express.py_$(date "+%Y_%m_%d_%H_%M") & disown
After the second run, you might assume that you have your minisets in the best quality possible, and you can start merging them in the best possible way.
reject.sh
You may add several cycles of deltaCC rejection in various resolution ranges -- e.g. perform low-resolution rejection first (to get rid of non-isomorphous data), and then high-resolution second (to improve your resolution):
# first cycle -- resolution range 30.0-10.0, 10 bins, 5.0 deltaCC cutoff
for i in `seq 1 1 5`; do
# part of the code omitted
./xdscc12 scaled_nonmerged.HKL -dmin 30.0 -dmax 10.0 -nbin 10 > XDSCC.LP
python xdscc.py XDSCC.LP 5.0 |& tee log.xdscc_"$i"
# part of the code omitted
...
done
# second cycle -- resolution range 5.0-2.5, 23 bins, 1.0 deltaCC cutoff
for i in `seq 6 1 10`; do
# part of the code omitted
./xdscc12 scaled_nonmerged.HKL -dmin 5.0 -dmax 2.5 -nbin 23 > XDSCC.LP
python xdscc.py XDSCC.LP 1.0 |& tee log.xdscc_"$i"
# part of the code omitted
...
done
REIDX
For some data sets, you may want to run reindexing (this is the case forC2_S_I4
dataset). To enable further deltaCC rejection, make sure you write your re-indexed datasets (run XSCALE withMERGE=FALSE
) asXDS_ASCII.HKL
in corresponding folder.
For primer in SFX data processing, please relate to original CrystFEL tutorial. Here, we discuss high-level wrappers used during the processing.
In deposited datasets, organization is the following:
C1_Zaf_P1
+-- raw_data
| +-- r0126-cyslt1-zaf
| +-- r0127-cyslt1-zaf
| | +-- cxilq5415-r0133-c00.cxi
+-- streams
| +-- c1_zaf_p1_2019_04_29_10_27_19.stream
| +-- c1_zaf_p1_2019_04_29_17_49_58.stream
+-- logs
+-- scratch
+-- initial.geom
+-- run_crystfel.sh
+-- analyse.sh
+-- c1_zaf.cell
+-- laststream -> streams/c1_zaf_p1_2019_04_29_17_49_58.stream
Folders scratch
, logs
and streams
are necessary for running analyse.sh
, please make sure you have them (do mkdir scratch; mkdir streams; mkdir logs
before running it). Note that laststream
points to the most recent stream for further analysis convenience.
find
Locate all input files (either*.h5
or*.cxi
, in publication only*.cxi
is the case) in your subfolders:
Usage:
find -name *.cxi > cxi.lst
list_events
Convert several-events-per-line inputcxi.lst
file to one-event-per-line inputcxi_event.lst
file (needs proper geometry file, which is provided in each deposition):
Usage:
list_events -i cxi.lst -o cxi-events.lst -g refined.geom
run_crystfel.sh
Wrapper for indexamajig routine, that i) arranges all crystfel-related files into subfolders ii) automatically assigns date and time for each generated stream and recpective log file iii) links last created stream tolaststream
link, and shuffles input file list, so that one could quickly and reliably check indexing rate before the indexing finishes.
# prefix for all *.stream files in streams folder
PROJECT_NAME="c1_zaf_p1"
# number of cores used for processing
NPROC="95"
# PEAK FINDING PARAMETERS (see `man indexamajig`)
SNR='4.5'
THRESHOLD='210'
HIGHRES='2.5'
LST='cxi-events.lst'
CELL='c1-zaf.cell'
shuf "$LST" > input.lst # your list must have events to enable this
GEOM="initial.geom"
ln -f -s "streams/"$PROJECT_NAME"_${time}.stream" laststream
indexamajig -i input.lst \
--temp-dir=scratch \
-o "streams/"$PROJECT_NAME"_${time}.stream" \
\
-g "$GEOM" \
--peaks=peakfinder8 \
-j "$NPROC" \
--min-snr="$SNR" \
--threshold="$THRESHOLD" \
--highres="$HIGHRES" \
\
-p "$CELL" \
--check-peaks \
\
--indexing=felix,dirax,asdf,mosflm,xds,taketwo |& tee logs/log.indexamajig_${time}
Usage:
# short run without logging off
bash run_crystfel.sh
# long background run
bash run_crystfel.sh & disown
analysis.sh
Wrapper for process_hkl, partialator, check_hkl and compare_hkl routines, which produces XSCALE.LP-like statistics table, counts images indexed with different indexers, produces command-line visible histogram of image resolution (for simple estimation of push-res parameter) and writes logs.
# Indexing analysis only:"
./analysis.sh -i laststream"
# Merging with process_hkl and analysis:"
./analysis.sh -i laststream --dorate 0 -j 96 --cell c1-zaf.cell --pushres 1.8 -s '-1' --highres 2.53
# Merging with partialator and analysis:
./analysis.sh -i laststream --dorate 1 -j 96 --cell c1-zaf.cell --pushres 1.8 -s '-1' --highres 2.53 --iterations 1 --model unity
Analysis of input stream will provide text histogram of resolution, estimated by crystfel, and info about all indexers success:
=================
Indexing details:
=================
.046 1986 asdf-nolatt-cell
.168 7224 dirax-nolatt-nocell
.389 16717 felix-latt-cell
.002 121 mosflm-latt-cell
.057 2470 taketwo-latt-cell
0 10 xds-latt-cell
=================
Indexing summary:
=================
Total number of images for processing: 43417
Number of processed images: 42907
Number of indexed: 28528
Number of crystals: 28900
Number of spots found: 2244193
Image indexing rate: .66
Crystals percentage: .67
Average crystals per image: 1.01
If merging was performed, following XSCALE.LP-like table will be written:
Center 1/nm # refs Possible Compl Meas Red SNR Std dev Mean d(A) Min 1/nm Max 1/nm Rsplit/% CC CC*
1.086 3036 3036 100.00 501807 165.3 10.75 6043.70 4365.55 9.21 0.333 1.838 8.02 0.9914373 0.9978478
2.076 3028 3028 100.00 309303 102.1 7.94 3191.70 3226.50 4.82 1.838 2.313 12.09 0.9720119 0.9928783
2.480 2992 2992 100.00 236952 79.2 4.85 2389.33 1890.02 4.03 2.313 2.647 19.67 0.9567785 0.9888943
2.780 3037 3037 100.00 175450 57.8 2.74 1122.50 866.10 3.60 2.647 2.913 38.39 0.8656982 0.9633355
3.025 3041 3041 100.00 176418 58.0 1.92 644.22 483.40 3.31 2.913 3.138 56.76 0.7393615 0.9220373
3.236 3020 3020 100.00 156472 51.8 1.18 468.81 272.44 3.09 3.138 3.334 98.64 0.5750552 0.8545193
3.422 3037 3037 100.00 132063 43.5 0.72 365.36 170.59 2.92 3.334 3.510 171.59 0.3274640 0.7024014
3.590 3043 3043 100.00 128703 42.3 0.61 360.10 143.10 2.79 3.510 3.669 211.22 0.2679825 0.6501470
3.743 3004 3004 100.00 116843 38.9 0.43 364.12 109.43 2.67 3.669 3.816 299.75 0.1926646 0.5684036
3.884 3063 3063 100.00 94652 30.9 0.31 385.49 79.01 2.57 3.816 3.953 487.25 0.0681203 0.3571438
-------------------------------------------------------------------------------------------------------------------------------------------------------
2.143 30301 30301 100.00 2028663 67.0 3.14 2749.07 1159.24 4.67 0.333 3.953 28.23 0.9739452 0.9933784