marinegor/README.md Secret

## README.md

      
    Raw
  

              README.md
            
          
    Small-wedge synchrotron and serial XFEL datasets for Cysteinyl leukotriene GPCRs

This is a github gist that contains scripts, made available for everyone to reproduce data processing in the publication.
SSX

General folder structure

In deposited datasets, organization is the following:
C2_L_C2221
+-- 001_01_01
+-- 002_01_02
|   +-- images
|   |   +-- 002_01_02_0[0-100].cbf
|   |   +-- x_geo_corr.cbf
|   |   +-- y_geo_corr.cbf
|   +-- XDS.INP
|   +-- XDS.INP.modified
|   +-- crystallization.txt
+-- 003_02_01
...
+-- express.py
+-- fin.csv
+-- reject.sh
+-- create_express_inp.py
+-- xdscc.py
+-- xdscc12
+-- rating.py
+-- XSCALE.express.py.INP

Each folder is numbered as XXX_YY_ZZ_NN, where:

XXX -- consequtive number in the datasest
YY -- crystallization conditions ID
ZZ -- loop number withing same crystallization conditinos
NN -- number of miniset within loop

Preparation of input files

create_xscale_inp.py -- creates fixed-name fin.csv table, implying folder structure as described above. The table is then used by express.py as input file. Each line in fin.csv represents one miniset. It contains following columns (in that order): name of the folder to be created by express.py for all XDS-related files in this datasets, location of XDS.INP file for this dataset, location of raw data for this dataset, number of images in this dataset.
Usage:
python create_xscale_inp.py
Integration-1

For the first integration, one has to have input file fin.csv ready (see above) and filtered, if necessary. Only the scripts express.py and reject.sh are necessary.

express.py -- Given a list of folders with XDS.INP and path to respective data sets, the script runs XDS for all the data sets in list, optionally adding UNIT_CELL_CONSTANTS, SPACE_GROUP_NUMBER, INCLUDE_RESOLUTION_RANGE, setting SPOT_RANGE same as DATA_RANGE, and setting REFERENCE_DATA_SET. Adds MAXIMUM_NUMBER_OF_PROCESSORS and MAXIMUM_NUMBER_OF_JOBS for processing on large clusters. Runs xscale_par afterwards.

data_summary_table = 'fin.csv'
space_group         = '!SPACE_GROUP_NUMBER=  1 \n'
unit_cell_constants = '!UNIT_CELL_CONSTANTS=  30 40 50 90 90 90\n'
max_num_proc        = 'MAXIMUM_NUMBER_OF_PROCESSORS= 80  \n'
max_num_jobs        = 'MAXIMUM_NUMBER_OF_JOBS=       48  \n'
resolution_range    = 'INCLUDE_RESOLUTION_RANGE= 30 3.0 \n'

#reference_data_set = 'REFERENCE_DATA_SET= %s \n'%'../reference.HKL'
reference_data_set = '!REFERENCE_DATA_SET= %s \n'%'../reference.HKL'
use_reference = True
Input parameters are to set by manually editing the script. If one want to use REFERENCE_DATA_SET keyword during integration, uncomment line with REFERENCE_DATA_SET in code and comment prefious one. By default, file reference.HKL in folder with express.py is used as reference.
Usage:
# for short data processing runs
python express.py
# for long run, where you may want to log off from the processing server and keep log file
python express.py |& tee log.express.py_$(date "+%Y_%m_%d_%H_%M") & disown
# to kill mistakinly started processing
pkill -f express.py; pkill -f xds_par; pkill -f xscale_par

reject.sh
Merges all XDS_ASCII.HKL files in subfolders of current folder, optionally choosing those that have particular space group. Then iteratively runs deltaCC12 rejection with given resolution range and number of cycles. Saves all intermediate XSCALE.INP-s and XSCALE.LP-s.

Usage:
bash reject.sh using default configuration (only part of the script is shown):
# will run 4 cycles of rejection
for i in `seq 1 1 4`; do
    xscale_par
    cp XSCALE.LP{,_$i}
    
    # will make deltaCC rejection in 32.0-10.0 resolutoin range with 5 bins
    ./xdscc12 scaled_nonmerged.HKL -dmin 32.0 -dmax 10 -nbin 5 > XDSCC.LP
    
    # will analyse output file XDSCC.LP
    # and write to good.xdscc names of only those minisets,
    # which have deltaCC12 higher then 2.0
    python xdscc.py XDSCC.LP 2.0 |& tee log.xdscc_"$i"
Optionally, you may want to inspect XSCALE.LP tables for all datasets and take the best one as reference or for further processing:
grep 'Nano' -A 25 XSCALE.LP_* | less
# choose best one, e.g. XSCALE.LP_2
cp xscale.inp_2 XSCALE.INP
xscale_par; cp scaled_nonmerged.HKL reference.HKL
You can also take only minisets with certain space group as initial input for further deltaCC rejection:
# comment this line
# ls */XDS_ASCII.HKL > xscale.inp

# and uncomment this -- here the space group is in first `grep`, number 22
grep SPACE_GROUP_NUMBER */XDS_ASCII.HKL | grep "22$" | tr ":" " " | awk '{print $1}' > xscale.inp

xdscc.py
Analyses output of xdscc utility together with last XSCALE.INP used, providing the list of datasets with their deltaCC12 values. Saves list good.xdscc of those which have deltaCC12 higher than input value.

Usage:
# run xdscc12, which is executable and located in current folder
# using scaled_nonmerged file (produced with XSCALE using MERGE=FALSE)
# in resolution range 40.0-2.8 and 13 bins
./xdscc12 scaled_nonmerged.HKL -dmin 40.0 -dmax 2.8 -nbin 13 > XDSCC.LP

# analyse output file XDSCC.LP, produced by xdscc12
# and write go good.xdscc only filenames with
# deltaCC > 3.0
python xdscc.py XDSCC.LP 3.0
This will give output of following kind:
1	056_03_02_01/XDS_ASCII.HKL	    -5.75
3	101_05_02_05/XDS_ASCII.HKL	    -0.86
--	--------------------------	     0.00
4	106_05_03_04/XDS_ASCII.HKL	     0.11
7	109_05_03_07/XDS_ASCII.HKL	     0.35
5	107_05_03_05/XDS_ASCII.HKL	     1.23
10	112_05_03_10/XDS_ASCII.HKL	     1.60
8	110_05_03_08/XDS_ASCII.HKL	     3.31
12	191_13_01_01/XDS_ASCII.HKL	     4.35
9	111_05_03_09/XDS_ASCII.HKL	     4.40
11	113_05_03_11/XDS_ASCII.HKL	     4.69
6	108_05_03_06/XDS_ASCII.HKL	     7.66
13	196_13_02_03/XDS_ASCII.HKL	     8.45
15	204_18_01_01/XDS_ASCII.HKL	    10.39
2	090_04_02_02/XDS_ASCII.HKL	    11.33
14	203_17_01_01/XDS_ASCII.HKL	    24.51
and write following good.xdscc:
110_05_03_08/XDS_ASCII.HKL
191_13_01_01/XDS_ASCII.HKL
111_05_03_09/XDS_ASCII.HKL
113_05_03_11/XDS_ASCII.HKL
108_05_03_06/XDS_ASCII.HKL
196_13_02_03/XDS_ASCII.HKL
204_18_01_01/XDS_ASCII.HKL
090_04_02_02/XDS_ASCII.HKL
203_17_01_01/XDS_ASCII.HKL
Integration-2

For second integration, you usually update express.py to have initial unit cell constants as in your reference dataset, and also increase resolution range (if you see that your reference data set has potential for it):

express.py

grep '^!UNIT_CELL_CONSTANTS' reference.HKL
>UNIT_CELL_CONSTANTS= 59.22     45.66     86.77  90.000  91.275  90.000
grep SPACE_GROUP_NUMBER reference.HKL
>!SPACE_GROUP_NUMBER=   4
Modify express.py input parameters:
data_summary_table = 'fin.csv'
space_group         = 'SPACE_GROUP_NUMBER=  4 \n'
unit_cell_constants = 'UNIT_CELL_CONSTANTS=  59.22     45.66     86.77  90.000  91.275  90.000\n'
max_num_proc        = 'MAXIMUM_NUMBER_OF_PROCESSORS= 80  \n'
max_num_jobs        = 'MAXIMUM_NUMBER_OF_JOBS=       48  \n'
resolution_range    = 'INCLUDE_RESOLUTION_RANGE= 30 2.5 \n'
And run it:
python express.py |& tee log.express.py_$(date "+%Y_%m_%d_%H_%M") & disown
Merging

After the second run, you might assume that you have your minisets in the best quality possible, and you can start merging them in the best possible way.

reject.sh
You may add several cycles of deltaCC rejection in various resolution ranges -- e.g. perform low-resolution rejection first (to get rid of non-isomorphous data), and then high-resolution second (to improve your resolution):

# first cycle -- resolution range 30.0-10.0, 10 bins, 5.0 deltaCC cutoff
for i in `seq 1 1 5`; do
    # part of the code omitted
    ./xdscc12 scaled_nonmerged.HKL -dmin 30.0 -dmax 10.0 -nbin 10 > XDSCC.LP
    python xdscc.py XDSCC.LP 5.0 |& tee log.xdscc_"$i"
    # part of the code omitted
    ...
done

# second cycle -- resolution range 5.0-2.5, 23 bins, 1.0 deltaCC cutoff
for i in `seq 6 1 10`; do
    # part of the code omitted
    ./xdscc12 scaled_nonmerged.HKL -dmin 5.0 -dmax 2.5 -nbin 23 > XDSCC.LP
    python xdscc.py XDSCC.LP 1.0 |& tee log.xdscc_"$i"
    # part of the code omitted
    ...
done

REIDX
For some data sets, you may want to run reindexing (this is the case for C2_S_I4 dataset). To enable further deltaCC rejection, make sure you write your re-indexed datasets (run XSCALE with MERGE=FALSE) as XDS_ASCII.HKL in corresponding folder.

SFX

For primer in SFX data processing, please relate to original CrystFEL tutorial. Here, we discuss high-level wrappers used during the processing.
In deposited datasets, organization is the following:
C1_Zaf_P1
+-- raw_data
|   +-- r0126-cyslt1-zaf
|   +-- r0127-cyslt1-zaf
|   |   +-- cxilq5415-r0133-c00.cxi
+-- streams
|   +-- c1_zaf_p1_2019_04_29_10_27_19.stream
|   +-- c1_zaf_p1_2019_04_29_17_49_58.stream
+-- logs
+-- scratch
+-- initial.geom
+-- run_crystfel.sh
+-- analyse.sh
+-- c1_zaf.cell
+-- laststream -> streams/c1_zaf_p1_2019_04_29_17_49_58.stream

Folders scratch, logs and streams are necessary for running analyse.sh, please make sure you have them (do mkdir scratch; mkdir streams; mkdir logs before running it). Note that laststream points to the most recent stream for further analysis convenience.
Preparation of input files


find
Locate all input files (either *.h5 or *.cxi, in publication only *.cxi is the case) in your subfolders:

Usage:
find -name *.cxi > cxi.lst

list_events
Convert several-events-per-line input cxi.lst file to one-event-per-line input cxi_event.lst file (needs proper geometry file, which is provided in each deposition):

Usage:
list_events -i cxi.lst -o cxi-events.lst -g refined.geom
Integration


run_crystfel.sh
Wrapper for indexamajig routine, that i) arranges all crystfel-related files into subfolders ii) automatically assigns date and time for each generated stream and recpective log file iii) links last created stream to laststream link, and shuffles input file list, so that one could quickly and reliably check indexing rate before the indexing finishes.

# prefix for all *.stream files in streams folder
PROJECT_NAME="c1_zaf_p1"
# number of cores used for processing
NPROC="95"

# PEAK FINDING PARAMETERS (see `man indexamajig`)
SNR='4.5'
THRESHOLD='210'
HIGHRES='2.5'

LST='cxi-events.lst'
CELL='c1-zaf.cell'

shuf "$LST" > input.lst # your list must have events to enable this
GEOM="initial.geom"

ln -f -s "streams/"$PROJECT_NAME"_${time}.stream" laststream
indexamajig -i input.lst \
--temp-dir=scratch \
-o "streams/"$PROJECT_NAME"_${time}.stream" \
\
-g "$GEOM" \
--peaks=peakfinder8 \
-j "$NPROC" \
--min-snr="$SNR" \
--threshold="$THRESHOLD" \
--highres="$HIGHRES" \
 \
-p "$CELL" \
--check-peaks \
 \
--indexing=felix,dirax,asdf,mosflm,xds,taketwo |& tee logs/log.indexamajig_${time}
Usage:
# short run without logging off
bash run_crystfel.sh
# long background run
bash run_crystfel.sh & disown
Integration analysis & merging


analysis.sh
Wrapper for process_hkl, partialator, check_hkl and compare_hkl routines, which produces XSCALE.LP-like statistics table, counts images indexed with different indexers, produces command-line visible histogram of image resolution (for simple estimation of push-res parameter) and writes logs.

# Indexing analysis only:"
./analysis.sh -i laststream"
# Merging with process_hkl and analysis:"
./analysis.sh -i laststream --dorate 0 -j 96 --cell c1-zaf.cell --pushres 1.8 -s '-1' --highres 2.53
# Merging with partialator and analysis:
./analysis.sh -i laststream --dorate 1 -j 96 --cell c1-zaf.cell --pushres 1.8 -s '-1' --highres 2.53 --iterations 1 --model unity
Analysis of input stream will provide text histogram of resolution, estimated by crystfel, and info about all indexers success:
=================
Indexing details:
=================
.046 	 1986 	 asdf-nolatt-cell
.168 	 7224 	 dirax-nolatt-nocell
.389 	 16717 	 felix-latt-cell
.002 	 121 	 mosflm-latt-cell
.057 	 2470 	 taketwo-latt-cell
0 	 10 	 xds-latt-cell
=================
Indexing summary:
=================
Total number of images for processing:	 43417
Number of processed images:		 42907
Number of indexed:	 28528
Number of crystals:	 28900
Number of spots found:	 2244193
Image indexing rate:		 .66
Crystals percentage:	 .67
Average crystals per image:	 1.01
If merging was performed, following XSCALE.LP-like table will be written:
Center 1/nm  # refs Possible  Compl       Meas   Red   SNR    Std dev       Mean     d(A)    Min 1/nm   Max 1/nm	Rsplit/%	CC	CC*
     1.086     3036     3036 100.00     501807 165.3 10.75    6043.70    4365.55     9.21       0.333      1.838	8.02	0.9914373	0.9978478
     2.076     3028     3028 100.00     309303 102.1  7.94    3191.70    3226.50     4.82       1.838      2.313	12.09	0.9720119	0.9928783
     2.480     2992     2992 100.00     236952  79.2  4.85    2389.33    1890.02     4.03       2.313      2.647	19.67	0.9567785	0.9888943
     2.780     3037     3037 100.00     175450  57.8  2.74    1122.50     866.10     3.60       2.647      2.913	38.39	0.8656982	0.9633355
     3.025     3041     3041 100.00     176418  58.0  1.92     644.22     483.40     3.31       2.913      3.138	56.76	0.7393615	0.9220373
     3.236     3020     3020 100.00     156472  51.8  1.18     468.81     272.44     3.09       3.138      3.334	98.64	0.5750552	0.8545193
     3.422     3037     3037 100.00     132063  43.5  0.72     365.36     170.59     2.92       3.334      3.510	171.59	0.3274640	0.7024014
     3.590     3043     3043 100.00     128703  42.3  0.61     360.10     143.10     2.79       3.510      3.669	211.22	0.2679825	0.6501470
     3.743     3004     3004 100.00     116843  38.9  0.43     364.12     109.43     2.67       3.669      3.816	299.75	0.1926646	0.5684036
     3.884     3063     3063 100.00      94652  30.9  0.31     385.49      79.01     2.57       3.816      3.953	487.25	0.0681203	0.3571438
   -------------------------------------------------------------------------------------------------------------------------------------------------------
     2.143    30301    30301 100.00    2028663  67.0  3.14    2749.07    1159.24     4.67       0.333      3.953	28.23	0.9739452	0.9933784

  
## analysis.sh
#!/bin/bash

dorate="-1"
symmetry='-1'
highres='3.0'
lowres='30.0'
iterations='0'
model='unity'
mincc="0.0"
scale="0";
pushres="inf"
j="6"

# looping over input parameters
while [[ $# -gt 1 ]]
do
key="$1"
cell_set=0
case $key in
    -i|--input)
    input="$2"
    shift # past argument
    ;;
    --dorate)
    dorate="$2"
    shift # past argument
    ;;
    -s|--symmetry)
    symmetry="$2"
    shift # past argument
    ;;
    --highres)
    highres="$2"
    shift # past argument
    ;;
    --lowres)
    lowres="$2"
    shift # past argument
    ;;
    --iterations)
    iterations="$2"
    shift # past argument
    ;;
    --mincc)
    mincc="$2"
    shift # past argument
    ;;
    -m|--model)
    model="$2"
    shift # past argument
    ;;
    --scale)
    scale="$2";
    shift # past argument
    ;;
    -p|--pushres)
    pushres="$2"
    shift # past argument
    ;;
    -h|--help)
    echo "Indexing analysis only:"
    echo "      ./indexing_analysis.sh output.stream"
    echo "Merging with process_hkl and analysis:"
    echo "      ./indexing_analysis.sh output.stream --dorate 0 --pushres 1.0 --highres 2.5 --lowres 30.0 --symmetry 222"
    echo "Merging with partialator and analysis:"
    echo "      ./indexing_analysis.sh output.stream --dorate 1 --pushres 1.0 --highres 2.5 --lowres 30.0 --symmetry 222"
	exit 0;
    shift # past argument
    ;;
    -j|--nproc)
    j="$2"
    shift # past argument
    ;;
    -c|--cell)
    cell="$2"
    cell_set=1
    shift # past argument
    ;;
    --default)
    DEFAULT=YES
    ;;
    *)
            # unknown option
    ;;
esac
shift # past argument or value
done

# if [[ "$scale" == "1" ]]; then
#     echo "YES";
# else
#     echo "NO";
# fi
# exit 0;


output=merging_stats_$(md5sum $input | cut -c1-5).csv


#----------------------------


# outputs to overall_stats.log statistics, obtained with check_hkl (SNR, multiplicity, N of refl, etc), and also Rsplit, CC and CC*.
function rate {
	rm stats[0-9].dat &>/dev/null
        compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom rsplit --nshells=10 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; cat shells.dat >  stats1.dat
        compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom cc     --nshells=10 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; grep -a -v "shitcentre" shells.dat > stats2.dat
        compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom ccstar --nshells=10 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; grep -a -v "shitcentre" shells.dat > stats3.dat
        check_hkl tmp.hkl -y "$symmetry" -p "$cell"                                       --lowres="$lowres" --highres "$highres" &> compare_hkl.log ; cat shells.dat > stats4.dat

        compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom rsplit --nshells=1 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; cat shells.dat >  stats5.dat
        compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom cc     --nshells=1 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; grep -a -v "shitcentre" shells.dat > stats6.dat
        compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom ccstar --nshells=1 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; grep -a -v "shitcentre" shells.dat >> stats7.dat
        check_hkl tmp.hkl --nshells 1 -y "$symmetry" -p "$cell"                          --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; cat shells.dat >> stats8.dat
	paste stats4.dat <(awk '{print $3'} stats1.dat) <(awk '{print $3'} stats2.dat) <(awk '{print $3'} stats3.dat) | head -1        > overall_stats.csv
	paste stats4.dat  <(awk '{print $2}' stats1.dat)  <(awk '{print $2}' stats2.dat)  <(awk '{print $2}' stats3.dat) | tail -n +2 >> overall_stats.csv

	echo "   -------------------------------------------------------------------------------------------------------------------------------------------------------" >> overall_stats.csv
	paste stats8.dat  <(awk '{print $2}' stats5.dat)  <(awk '{print $2}' stats6.dat)  <(awk '{print $2}' stats7.dat) | tail -n +2 >> overall_stats.csv
}


echo "Filename for current run: $input"
echo "Stream generated by:  $(grep -a 'Generated by' "$input" | uniq)"

pythonstring='from __future__ import print_function; print(*[i.split("-i")[1].split()[0] for i in open("'$input'").readlines() if "indexamajig" in i],sep="\n")'
NIMAGES_INPUT=$(python2 -c "$pythonstring" | xargs wc -l 2> /dev/null | tail -1 | awk '{print $1}')
if [[ "$NIMAGES_INPUT" -eq 0 ]]; then
	NIMAGES_INPUT="n/a (file lists not available)"
fi

#-----------------------

number_of_streams=$(grep -a 'indexamajig' $input | wc -l) # grep -as number of streams used for dorate processing
if [[ "$number_of_streams" -gt 1 ]]
then
	echo "Multi-stream mode; number of streams: $number_of_streams"
	echo "indexamajig string: $(grep -a 'indexamajig' $input | tail -1)"
else
	echo "Single-stream mode; number of streams: 1"
	echo "indexamajig string: $(grep -a indexamajig $input)"
fi


echo "md5 checksum: $(md5sum $input)"
echo "Date: $(date -R)"

echo "================="
echo "Indexing details:"
echo "================="

NIMAGES=$(grep -a "Begin chunk" $input | wc -l )
NCRYST=$(grep -a "Begin crystal" $input | wc -l )

# lists all indexing methods used
METHODS=($(egrep -a "indexed_by" "$input" | grep -a -v 'none' | sort | uniq | awk 'NF>1{print $NF}' | tr '\n' ' '))
NINDEXED=0

for i in "${METHODS[@]}"
do
	if [ $i = "none" ]
	then
		continue
	fi

	tmp="$(egrep -a -w "$i" "$input" | wc -l)"
	let "NINDEXED=$NINDEXED+$tmp"
	ratio=$(echo " scale=3; $tmp/$NIMAGES" | bc)
	echo -e $ratio "\t" $tmp "\t" "$i"
done

NSPOTS=$(grep -a "num_reflections" "$input" | awk '{print $3;}' | paste -sd+ | bc)


echo "================="
echo "Indexing summary:"
echo "================="
echo "Total number of images for processing:	" $NIMAGES_INPUT
echo "Number of processed images:		" $NIMAGES
echo "Number of indexed:	" $NINDEXED
echo "Number of crystals:	" $NCRYST
echo "Number of spots found:	" $NSPOTS
#echo "Spots per image:	" $(echo "scale=2; $NSPOTS/$NIMAGES" | bc )
#echo "Spots per crystal:	" $(echo "scale=2; $NSPOTS/$NCRYST" | bc )
echo "Image indexing rate:		" $(echo "scale=2; $NINDEXED/$NIMAGES" | bc )
echo "Crystals percentage:	" $(echo "scale=2; $NCRYST/$NIMAGES" | bc)
echo "Average crystals per image:	" $(echo "scale=2; $NCRYST/$NINDEXED" | bc)


echo "==================="
echo "Resolution summary:"
echo "==================="
grep 'diffraction_resolution_limit' $input | awk '{print $6}' | sort -n > reslim.txt
python2 -c 'from text_histogram import histogram; histogram([float(elem) for elem in open("reslim.txt").read().split("\n") if elem and float(elem) < 10], buckets=15)'

echo "======================="
echo "Profile radius summary:"
echo "======================="
grep 'profile_radius' $input | awk '{print $3}' | sort -n > profile_radius.txt
python2 -c 'from text_histogram import histogram; histogram([float(elem) for elem in open("profile_radius.txt").read().split("\n") if elem], buckets=15)'


if [[ "$dorate" == "1" ]]; then
	# runs partialator to estimate rmeas and other foms
	partialator -i "$input" -o tmp.hkl --iterations "$iterations" -j "$j" --model "$model"  --push-res "$pushres" -y "$symmetry"  &> partialator.log
	rate
elif [[ "$dorate" == "0" ]]; then
	if [[ "$scale" == "1" ]]; then
    		process_hkl -i "$input" --min-cc "$mincc" --scale -o tmp.hkl  -y "$symmetry" --min-res "$lowres" --push-res "$pushres"
    		process_hkl -i "$input" --min-cc "$mincc" --scale -o tmp.hkl1 -y "$symmetry" --min-res "$lowres" --push-res "$pushres" --odd-only
    		process_hkl -i "$input" --min-cc "$mincc" --scale -o tmp.hkl2 -y "$symmetry" --min-res "$lowres" --push-res "$pushres" --even-only
	else
    		process_hkl -i "$input" --min-cc "$mincc" -o tmp.hkl  -y "$symmetry" --min-res "$lowres" --push-res "$pushres"
    		process_hkl -i "$input" --min-cc "$mincc" -o tmp.hkl1 -y "$symmetry" --min-res "$lowres" --push-res "$pushres" --odd-only
    		process_hkl -i "$input" --min-cc "$mincc" -o tmp.hkl2 -y "$symmetry" --min-res "$lowres" --push-res "$pushres" --even-only
	fi
    rate
else
	:
fi


if [[ "$dorate" == "-1" ]]; then
	# rate
	exit 0; fi

echo "================"
echo "Merging summary:"
echo "================"
echo "Merging stats backup file: $output"
tail tmp.hkl | tail -n 5 | head -n 1


echo "================" >>  "$output"
echo "Merging summary:" >> "$output"
echo "================" >> "$output"
tail tmp.hkl | tail -n 5 | head -n 1 >> "$output"
cat overall_stats.csv >> "$output"


rm stats[0-9].dat
cat overall_stats.csv


## create_express_inp.py
#!/usr/env/bin python

from __future__ import print_function
import os

for elem in os.listdir("."):
    if "merging" in elem:
        continue
    if not os.path.isfile(elem + "/XDS.INP"):
        continue
    if os.path.isdir(elem):
        print(
            elem,
            "/".join([os.getcwd(), elem]),
            "/".join([os.getcwd(), elem, "/images"]),
            1,
            len(os.listdir("/".join([os.getcwd(), elem, "/images"]))) - 2,
            sep=",",
        )

## cxidb_id106.txt
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/analysis.sh
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1.lst
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1.pdb
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1_events.lst
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1_p21_2019_04_09_11_40_00.stream
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1_v1.pdb
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/check-near-bragg
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/initial-predrefine.geom
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/initial.geom
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/initial_v1.geom
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/initial_v2.geom
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/input.lst
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/overall_stats.csv.backup
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/runcrystfel.sh
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/streams.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/tmp.hkl
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/tmp.hkl1
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/tmp.hkl2
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/tmp.lst
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0012-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0013-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0014-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0015-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0016-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0017-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0018-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0019-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0020-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0058-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0059-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0060-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0061-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0062-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0063-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0064-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0065-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0066-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0067-cyslt1-nh4.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0119-lys3.tar
https://cxidb.org/data/106/C1_Zaf_P21_stream.tar.gz

## cxidb_id107.txt
https://www.cxidb.org/data/107/6RZ5_CysLT1R_stream.tar.gz
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0127-cyslt1-zaf.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0128-cyslt1-zaf.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0129-cyslt1-zaf.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0130-cyslt1-zaf.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0131-cyslt1-zaf.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0133-cyslt1-zaf.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0180-cyslt1-zaf.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0181-cyslt1-zaf.tar
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/analysis.sh
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/c1-zaf.cell
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/cxi-events.lst
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/cxi.lst
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/index.html
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/input.lst
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/overall_stats.csv
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_files.md5
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_files_list.txt
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/refined.geom
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/runcrystfel.sh
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/streams.tar.gz
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/tmp.hkl
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/tmp.hkl1
http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/tmp.hkl2

## download_all.sh
#!/bin/bash

download(){
    local DATASET="$1"
    local file_list="$2"

    mkdir "$DATASET"
    while read url; do
        (cd "$DATASET" && curl -O "$url")
    done < "$file_list"
}

DATASETS=("cxidb_ID106_C1_Zaf_P21" "cxidb_ID107_C1_Zaf_P1" "zenodo_6RZ4_CysLT1R" "zenodo_6RZ6_CysLT2R" "zenodo_6RZ7_CysLT2R" "zenodo_6RZ8_CysLT2R" "zenodo_6RZ9_CysLT2R")
LISTS=("cxidb_id106.txt" "cxidb_id107.txt" "zenodo_cyslt1r_6RZ4.txt" "zenodo_cyslt2r_6RZ6.txt" "zenodo_cyslt2r_6RZ7.txt" "zenodo_cyslt2r_6RZ8.txt" "zenodo_cyslt2r_6RZ9.txt")

for i in `seq 0 1 7`; do
    echo "Downloading: ${DATASETS[$i]} ${LISTS[$i]}"
    download ${DATASETS[$i]} ${LISTS[$i]}
done |& tee download.log

## express.py
#!/usr/bin/python
from __future__ import print_function
import os
import re
from shutil import copyfile


# looks for pattern in a string; returns True if found, False if not
def Find(pat, text):
    match = re.search(pat, text)
    return match


data_summary_table = "fin.csv"
working_directory = os.getcwd()  # assumes that you are already in processing folder


space_group = "!SPACE_GROUP_NUMBER=  4 \n"
unit_cell_constants = (
    "!UNIT_CELL_CONSTANTS=  59.22     45.66     86.77  90.000  91.275  90.000\n"
)
max_num_proc = "MAXIMUM_NUMBER_OF_PROCESSORS= 80  \n"
max_num_jobs = "MAXIMUM_NUMBER_OF_JOBS=       48  \n"
resolution_range = "INCLUDE_RESOLUTION_RANGE= 40 2.0 \n"

# reference_data_set = ''
reference_data_set = "REFERENCE_DATA_SET= %s \n" % "../reference.HKL"
if len(reference_data_set) > 0:
    use_reference = True
else:
    use_reference = False


fin = open(data_summary_table).read().split("\n")
log = open("log.express", "w")

# reading input files into XDSs
XDSs = dict()
for index, string in enumerate(fin):
    try:
        name, data, inp, data_range_start, data_range_stop = string.split(",")
    except ValueError:
        print("Error while loading string %d:\t%s" % (index, string), file=log)
        continue
    XDSs[name] = [data, inp, data_range_start, data_range_stop]

print("Following folders detected:\n", file=log)
for name in XDSs.keys():
    print("%s\n \t%s\n \t%s\n\n" % (name, XDSs[name][0], XDSs[name][1]), file=log)


# for each dataset folder does some stuff
for name in XDSs.keys():
    os.chdir(working_directory)

    xycorr = XDSs[name][1]
    xds = XDSs[name][1] + "/XDS.INP"
    data = XDSs[name][0]
    data_range_start = XDSs[name][2]
    data_range_stop = XDSs[name][3]

    os.chdir(name)
    project_dir = os.getcwd()
    # remember current directory

    copyfile("../XSCALE.express.py.INP", "XSCALE.INP")

    xds = open("XDS.INP", "r")
    modif = xds.readlines()
    xds.close()

    job = False
    spotrange_first = False
    for i, string in enumerate(modif):
        noreference = True

        if False:
            pass

        elif Find("SPACE_GROUP_NUMBER=", string):
            modif[i] = space_group
            print("### Space group added")

        elif Find("UNIT_CELL_CONSTANTS=", string):
            # modif[i] = 'UNIT_CELL_CONSTANTS= 36.337 35.631 41.277 90.000 93.606 90.000\n'
            modif[i] = unit_cell_constants
            print("### Unit cell constants added")

        elif Find("JOB=", string):
            if job:
                modif[i] = "\n"
            else:
                modif[i] = "JOB=XYCORR INIT COLSPOT IDXREF\n"
                job = True
            print("### Job added")

        elif Find("JOB=", string) and job == True:
            modif[i] = "\n"

        elif Find("SECONDS=", string):
            modif[i] = "!" + string
            print("### Seconds added")

        elif Find("MAXIMUM_NUMBER_OF_PROCESSORS=", string):
            modif[i] = max_num_proc
            print("### Maximum number of processors added")

        elif Find("MAXIMUM_NUMBER_OF_JOBS=", string):
            modif[i] = max_num_jobs
            print("### Maximum number of jobs added")

        elif Find("X-GEO_CORR=", string):
            os.system("bzip2 -d %s" % (xycorr + "/x_geo_corr.cbf.bz2\n"))
            # modif[i] = 'X-GEO_CORR=%s'%(xycorr + '/x_geo_corr.cbf\n')
            modif[i] = "images/x_geo_corr.cbf\n"
        #
        elif Find("Y-GEO_CORR=", string):
            os.system("bzip2 -d %s" % (xycorr + "/y_geo_corr.cbf.bz2\n"))
            # modif[i] = 'Y-GEO_CORR=%s'%(xycorr + '/y_geo_corr.cbf\n')
            modif[i] = "images/y_geo_corr.cbf\n"

        elif Find("RESOLUTION_RANGE", string):
            modif[i] = resolution_range
            print("### Resolution range added")

        elif Find("LIB", string):
            modif[
                i
            ] = "LIB=/home/marin/Apps/neggia/build/src/dectris/neggia/plugin/dectris-neggia.so \n"

        elif Find("REFERENCE_DATA_SET", string):
            modif[i] = ""
            if use_reference:
                modif[i] = reference_data_set
                noreference = False
            else:
                pass

        elif Find("SPOT_RANGE", string):
            modif[i] = "SPOT_RANGE= %s %s\n" % (data_range_start, data_range_stop)

    noreference = True
    if noreference:
        modif.append("REFERENCE_DATA_SET= ../reference.HKL\n")

    with open("XDS.INP", "w") as xds:
        xds.writelines(modif)

    xds.close()

    os.system("xds_par")
    os.system("xscale_par")
    os.system("cp XDS_ASCII.HKL XDS_ASCII.HKL_old")

    os.system("cp GXPARM.XDS XPARM.XDS")
    os.system("mv CORRECT.LP CORRECT.LP.old")
    os.system("mv XSCALE.LP XSCALE.LP.old")
    os.system("egrep -v 'JOB|REIDX' XDS.INP > XDS.INP.new")
    os.system(
        'echo "! JOB=XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE CORRECT" > XDS.INP'
    )
    os.system('echo "JOB=DEFPIX INTEGRATE CORRECT" >> XDS.INP')
    os.system("cat XDS.INP.new >> XDS.INP")
    os.system("xds_par")
    os.system("xscale_par")


os.system("ls -alt")
log.close()

## reject.sh
#!/bin/bash

cp XSCALE.INP XSCALE.INP.reject_backup
echo "MAXIMUM_NUMBER_OF_PROCESSORS= 80" >> xscale.inp
ls */XDS_ASCII.HKL > xscale.inp
sed -e 's/^/INPUT_FILE= /g' xscale.inp | sed -e 's/$/\nINCLUDE_RESOLUTION_RANGE= 30 2.5/' > XSCALE.INP.reject_0
echo "OUTPUT_FILE= scaled_nonmerged.HKL" > XSCALE.INP
echo "MERGE= FALSE" >> XSCALE.INP
echo "!REFERENCE_DATA_SET= reference.HKL" >> XSCALE.INP
echo "" >> XSCALE.INP
cat XSCALE.INP.reject_0 >> XSCALE.INP

cp XSCALE.INP XSCALE.INP.reject_0

for i in `seq 1 1 5`; do
	xscale_par
	cp XSCALE.LP{,_$i}
	./xdscc12 scaled_nonmerged.HKL -dmin 30.0 -dmax 10.0 -nbin 7 > XDSCC.LP
	python xdscc.py XDSCC.LP 1.0 |& tee log.xdscc_"$i"
	cp good.xdscc xscale.inp_"$i"
	sed -e 's/^/INPUT_FILE= /g' xscale.inp_"$i" | sed -e 's/$/\nINCLUDE_RESOLUTION_RANGE= 30 2.5/' > XSCALE.INP.reject_"$i"
	echo "MAXIMUM_NUMBER_OF_PROCESSORS= 80" >> XSCALE.INP
	echo "OUTPUT_FILE= scaled_nonmerged.HKL" > XSCALE.INP
	echo "MERGE= FALSE" >> XSCALE.INP
	echo "!REFERENCE_DATA_SET= reference.HKL" >> XSCALE.INP
	echo "" >> XSCALE.INP
	cat XSCALE.INP.reject_"$i" >> XSCALE.INP
	cp scaled_nonmerged.HKL reference.HKL
done

for i in `seq 6 1 9`; do
	xscale_par
	cp XSCALE.LP{,_$i}
	./xdscc12 scaled_nonmerged.HKL -dmin 10.0 -dmax 2.5 -nbin 15 > XDSCC.LP
	python xdscc.py XDSCC.LP 1.0 |& tee log.xdscc_"$i"
	cp good.xdscc xscale.inp_"$i"
	sed -e 's/^/INPUT_FILE= /g' xscale.inp_"$i" | sed -e 's/$/\nINCLUDE_RESOLUTION_RANGE= 30 2.5/' > XSCALE.INP.reject_"$i"
	echo "MAXIMUM_NUMBER_OF_PROCESSORS= 80" >> XSCALE.INP
	echo "OUTPUT_FILE= scaled_nonmerged.HKL" > XSCALE.INP
	echo "MERGE= FALSE" >> XSCALE.INP
	echo "!REFERENCE_DATA_SET= reference.HKL" >> XSCALE.INP
	echo "" >> XSCALE.INP
	cat XSCALE.INP.reject_"$i" >> XSCALE.INP
	cp scaled_nonmerged.HKL reference.HKL
done


## runcrystfel.sh
#!/bin/bash

time=$(date "+%Y_%m_%d_%H_%M_%S")

PROJECT_NAME="protein"
NPROC=`nproc`

# PEAK FINDING PARAMETERS
SNR='4.4'
THRESHOLD='20'
HIGHRES='3.0'

LST='c1_events.lst'
CELL='c1_v1.pdb'

#shuf "$LST" | head -n 1000 > input.lst # your list must have events to enable this
shuf "$LST" > input.lst # your list must have events to enable this

GEOM="initial_v1.geom"


ln -f -s "streams/"$PROJECT_NAME"_${time}.stream" laststream
indexamajig -i input.lst \
--temp-dir=scratch \
-o "streams/"$PROJECT_NAME"_${time}.stream" \
\
-g "$GEOM" \
--peaks=peakfinder8 \
-j "$NPROC" \
--min-snr="$SNR" \
--threshold="$THRESHOLD" \
--highres="$HIGHRES" \
--max-res=300 \
--min-res=80 \
 \
-p "$CELL" \
--check-peaks \
 \
 --multi \
--indexing=dirax,xds,asdf,taketwo,xgandalf |& tee logs/log.indexamajig_${time}


## xdscc.py
#!/usr/bin/env python

from __future__ import print_function
import os
import sys
import re

# USAGE EXAMPLE:
#  xscale_par; grep 'Nano' -A 25 XSCALE.LP; xdscc12 scaled_nonmerged.HKL -dmin 5.0 -dmax 2.5 -nbin 10 > XDSCC.LP; python xdscc.py XDSCC.LP

fin = open("XDSCC.LP")
fin = open(sys.argv[1])
fin_re = open(sys.argv[1])
try:
    cutoff = float(sys.argv[2])
except IndexError:
    cutoff = 1.0


def fill(string, N=15):
    "Pads string with spaces up to length N"
    if len(string) > N:
        return string[:N]
    else:
        return string + " " * (N - len(string))


def get_rejected_crystals(xdscclp, rejection_func=None, mode="noano"):
    """
        Parsing of XDSCC12.LP file using rejection critecia 'rejection_func'.
        Returns set of numbers -- the bad crystals with respect to numbering in
        the initial xdscclp file.
        """

    fin = open(xdscclp)
    bad_crystals = set()
    if mode == "noano":
        a = re.compile("^a\s+")
        b = re.compile("^b\s+")
        c = re.compile("^c\s+")
    elif mode == "ano":
        a = re.compile("^d\s+")
        b = re.compile("^e\s+")
        c = re.compile("^f\s+")
    else:
        print("Wrong mode given to get_rejected_crystals: %s" % mode)
        sys.exit(1)

    while True:
        fline = fin.readline()
        if not fline:
            break

        if a.match(fline):
            crystal_number = int(fline.split()[1])
            if crystal_number % 100 == 0:
                print("Working with crystal number %d" % crystal_number, end="\r")

        elif b.match(fline):
            try:
                CC = [float(i) for i in fline.split()[1:]]
            except ValueError:
                # print("Unusual pattern while parsing CC in %s"%fline)
                CC = resolving(fline.replace("-100", " -100"))
                CC = [i for i in CC if i != 0.0]

        elif c.match(fline):
            Nref = [int(i) for i in fline.split()[1:]]

            CCaverage = sum([i for i in CC]) / len(CC)
            if CCaverage < 0 and sum(Nref) / len(Nref) > 10:
                bad_crystals.add(crystal_number)

    # returns set()
    return bad_crystals


# expressins to parse HKL file
dataset = lambda string: " ISET=" in string and "INPUT_FILE" in string
reflection_file = lambda string: "reflection file is" in string

getnamesfrom = [i for i in fin_re.readlines() if reflection_file(i)][0].split()[-1]
getnamesfrom = open(getnamesfrom)
fin_re.close()

datasets_from_xscale = dict()
i = 1

for fline in getnamesfrom.readlines():
    if dataset(fline):
        # print(fline,end='')
        datasets_from_xscale[i] = {"name": fline.split("INPUT_FILE=")[-1][:-1]}
        i += 1

getnamesfrom.close()


# expressions to parse XDSCC.LP

resolution_shells = lambda string: "resolution shells (for lines starting" in string
abcdef = re.compile("^[abcdef]\s+", re.M)

next_shells = False
for fline in fin.readlines():
    if resolution_shells(fline):
        next_shells = True
        continue
    elif next_shells:
        shells = [float(i) for i in fline.split() if i]
        next_shells = False

    if "overall" in fline:
        j = 0
    elif abcdef.match(fline):
        current_type, numbers = fline.split()[0], [float(i) for i in fline.split()[1:]]
        if current_type == "a" or current_type == "d":
            j += 1
        datasets_from_xscale[j][current_type] = numbers


iterxds = True
try:
    fin = open("iterxds.log")
except IOError:
    iterxds = False

if iterxds:
    i = 1
    for fline in fin.readlines():
        if "overall" in fline and len(fline.split()) > 3:
            rmeas_overall_low = fline.replace("%", "").split()[0]
        elif "XDS_ASCII" in fline:
            name = fline.split()[-1]
            rmeas_low, rmeas_overall = fline.replace("%", "").split()[:2]
            rmeas_low = float(rmeas_low)
            rmeas_overall = float(rmeas_overall)
            for key in datasets_from_xscale.keys():
                if datasets_from_xscale[key]["name"] == name:
                    datasets_from_xscale[key]["rmeas_low"] = rmeas_low
                    datasets_from_xscale[key]["rmeas_overall"] = rmeas_overall


fout = open("good.xdscc", "w")

padding_length = max([len(elem["name"]) for elem in datasets_from_xscale.values()])
toprint = ["--\t%s\t %8.2f" % ("-" * padding_length, 0)]
for key in datasets_from_xscale.keys():
    name = datasets_from_xscale[key]["name"]
    CCnoano = datasets_from_xscale[key]["b"]
    Nrefsnoano = datasets_from_xscale[key]["c"]
    # CCano = datasets_from_xscale[key]['e']
    # Nrefsano = datasets_from_xscale[key]['f']
    if iterxds:
        rmeas_low = datasets_from_xscale[key]["rmeas_low"]
        rmeas_overall = datasets_from_xscale[key]["rmeas_overall"]
        toprint.append(
            "%d\t%s\t %8.2f\t%2.2f\t%2.2f"
            % (
                key,
                fill(name, N=padding_length),
                sum(CCnoano) / len(CCnoano),
                rmeas_low,
                rmeas_overall,
            )
        )
    else:
        toprint.append(
            "%d\t%s\t %8.2f"
            % (key, fill(name, N=padding_length), sum(CCnoano) / len(CCnoano))
        )
        if sum(CCnoano) / len(CCnoano) >= cutoff:
            print("%s" % name, file=fout)


print(*sorted(toprint, key=lambda f: float(f.split()[2])), sep="\n")

## zenodo_cyslt1r_6RZ4.txt
https://zenodo.org/record/3921911/files/6RZ4_C1_Pran.tar.gz?download=1
https://zenodo.org/record/3921911/files/6RZ4_C1_Pran_hkls.tar.gz?download=1

## zenodo_cyslt2r_6RZ6.txt
https://zenodo.org/record/3842753/files/6RZ6_C2_L_C2221_hkls.tar.gz?download=1
https://zenodo.org/record/3842753/files/6RZ6_C2_L_C2221.tar.gz?download=1

## zenodo_cyslt2r_6RZ7.txt
https://zenodo.org/record/3921930/files/6RZ7_C2_L_F222_hkls.tar.gz?download=1
https://zenodo.org/record/3921930/files/6RZ7_C2_L_F222.tar.gz?download=1

## zenodo_cyslt2r_6RZ8.txt
https://zenodo.org/record/3921931/files/6RZ8_C2_S_I4_hkls.tar.gz?download=1
https://zenodo.org/record/3921931/files/6RZ8_C2_S_I4.tar.gz?download=1

## zenodo_cyslt2r_6RZ9.txt
https://zenodo.org/record/3921934/files/6RZ9_C2_O_C2221_hkls.tar.gz?download=1
https://zenodo.org/record/3921934/files/6RZ9_C2_O_C2221.tar.gz?download=1
	#!/bin/bash

	dorate="-1"
	symmetry='-1'
	highres='3.0'
	lowres='30.0'
	iterations='0'
	model='unity'
	mincc="0.0"
	scale="0";
	pushres="inf"
	j="6"

	# looping over input parameters
	while [[ $# -gt 1 ]]
	do
	key="$1"
	cell_set=0
	case $key in
	-i\|--input)
	input="$2"
	shift # past argument
	;;
	--dorate)
	dorate="$2"
	shift # past argument
	;;
	-s\|--symmetry)
	symmetry="$2"
	shift # past argument
	;;
	--highres)
	highres="$2"
	shift # past argument
	;;
	--lowres)
	lowres="$2"
	shift # past argument
	;;
	--iterations)
	iterations="$2"
	shift # past argument
	;;
	--mincc)
	mincc="$2"
	shift # past argument
	;;
	-m\|--model)
	model="$2"
	shift # past argument
	;;
	--scale)
	scale="$2";
	shift # past argument
	;;
	-p\|--pushres)
	pushres="$2"
	shift # past argument
	;;
	-h\|--help)
	echo "Indexing analysis only:"
	echo " ./indexing_analysis.sh output.stream"
	echo "Merging with process_hkl and analysis:"
	echo " ./indexing_analysis.sh output.stream --dorate 0 --pushres 1.0 --highres 2.5 --lowres 30.0 --symmetry 222"
	echo "Merging with partialator and analysis:"
	echo " ./indexing_analysis.sh output.stream --dorate 1 --pushres 1.0 --highres 2.5 --lowres 30.0 --symmetry 222"
	exit 0;
	shift # past argument
	;;
	-j\|--nproc)
	j="$2"
	shift # past argument
	;;
	-c\|--cell)
	cell="$2"
	cell_set=1
	shift # past argument
	;;
	--default)
	DEFAULT=YES
	;;
	*)
	# unknown option
	;;
	esac
	shift # past argument or value
	done

	# if [[ "$scale" == "1" ]]; then
	# echo "YES";
	# else
	# echo "NO";
	# fi
	# exit 0;


	output=merging_stats_$(md5sum $input \| cut -c1-5).csv


	#----------------------------


	# outputs to overall_stats.log statistics, obtained with check_hkl (SNR, multiplicity, N of refl, etc), and also Rsplit, CC and CC*.
	function rate {
	rm stats[0-9].dat &>/dev/null
	compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom rsplit --nshells=10 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; cat shells.dat > stats1.dat
	compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom cc --nshells=10 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; grep -a -v "shitcentre" shells.dat > stats2.dat
	compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom ccstar --nshells=10 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; grep -a -v "shitcentre" shells.dat > stats3.dat
	check_hkl tmp.hkl -y "$symmetry" -p "$cell" --lowres="$lowres" --highres "$highres" &> compare_hkl.log ; cat shells.dat > stats4.dat

	compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom rsplit --nshells=1 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; cat shells.dat > stats5.dat
	compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom cc --nshells=1 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; grep -a -v "shitcentre" shells.dat > stats6.dat
	compare_hkl tmp.hkl1 tmp.hkl2 -y "$symmetry" -p "$cell" --fom ccstar --nshells=1 --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; grep -a -v "shitcentre" shells.dat >> stats7.dat
	check_hkl tmp.hkl --nshells 1 -y "$symmetry" -p "$cell" --lowres "$lowres" --highres "$highres" &> compare_hkl.log ; cat shells.dat >> stats8.dat
	paste stats4.dat <(awk '{print $3'} stats1.dat) <(awk '{print $3'} stats2.dat) <(awk '{print $3'} stats3.dat) \| head -1 > overall_stats.csv
	paste stats4.dat <(awk '{print $2}' stats1.dat) <(awk '{print $2}' stats2.dat) <(awk '{print $2}' stats3.dat) \| tail -n +2 >> overall_stats.csv

	echo " -------------------------------------------------------------------------------------------------------------------------------------------------------" >> overall_stats.csv
	paste stats8.dat <(awk '{print $2}' stats5.dat) <(awk '{print $2}' stats6.dat) <(awk '{print $2}' stats7.dat) \| tail -n +2 >> overall_stats.csv
	}



	echo "Filename for current run: $input"
	echo "Stream generated by: $(grep -a 'Generated by' "$input" \| uniq)"

	pythonstring='from __future__ import print_function; print(*[i.split("-i")[1].split()[0] for i in open("'$input'").readlines() if "indexamajig" in i],sep="\n")'
	NIMAGES_INPUT=$(python2 -c "$pythonstring" \| xargs wc -l 2> /dev/null \| tail -1 \| awk '{print $1}')
	if [[ "$NIMAGES_INPUT" -eq 0 ]]; then
	NIMAGES_INPUT="n/a (file lists not available)"
	fi

	#-----------------------

	number_of_streams=$(grep -a 'indexamajig' $input \| wc -l) # grep -as number of streams used for dorate processing
	if [[ "$number_of_streams" -gt 1 ]]
	then
	echo "Multi-stream mode; number of streams: $number_of_streams"
	echo "indexamajig string: $(grep -a 'indexamajig' $input \| tail -1)"
	else
	echo "Single-stream mode; number of streams: 1"
	echo "indexamajig string: $(grep -a indexamajig $input)"
	fi


	echo "md5 checksum: $(md5sum $input)"
	echo "Date: $(date -R)"

	echo "================="
	echo "Indexing details:"
	echo "================="

	NIMAGES=$(grep -a "Begin chunk" $input \| wc -l )
	NCRYST=$(grep -a "Begin crystal" $input \| wc -l )

	# lists all indexing methods used
	METHODS=($(egrep -a "indexed_by" "$input" \| grep -a -v 'none' \| sort \| uniq \| awk 'NF>1{print $NF}' \| tr '\n' ' '))
	NINDEXED=0

	for i in "${METHODS[@]}"
	do
	if [ $i = "none" ]
	then
	continue
	fi

	tmp="$(egrep -a -w "$i" "$input" \| wc -l)"
	let "NINDEXED=$NINDEXED+$tmp"
	ratio=$(echo " scale=3; $tmp/$NIMAGES" \| bc)
	echo -e $ratio "\t" $tmp "\t" "$i"
	done

	NSPOTS=$(grep -a "num_reflections" "$input" \| awk '{print $3;}' \| paste -sd+ \| bc)


	echo "================="
	echo "Indexing summary:"
	echo "================="
	echo "Total number of images for processing: " $NIMAGES_INPUT
	echo "Number of processed images: " $NIMAGES
	echo "Number of indexed: " $NINDEXED
	echo "Number of crystals: " $NCRYST
	echo "Number of spots found: " $NSPOTS
	#echo "Spots per image: " $(echo "scale=2; $NSPOTS/$NIMAGES" \| bc )
	#echo "Spots per crystal: " $(echo "scale=2; $NSPOTS/$NCRYST" \| bc )
	echo "Image indexing rate: " $(echo "scale=2; $NINDEXED/$NIMAGES" \| bc )
	echo "Crystals percentage: " $(echo "scale=2; $NCRYST/$NIMAGES" \| bc)
	echo "Average crystals per image: " $(echo "scale=2; $NCRYST/$NINDEXED" \| bc)


	echo "==================="
	echo "Resolution summary:"
	echo "==================="
	grep 'diffraction_resolution_limit' $input \| awk '{print $6}' \| sort -n > reslim.txt
	python2 -c 'from text_histogram import histogram; histogram([float(elem) for elem in open("reslim.txt").read().split("\n") if elem and float(elem) < 10], buckets=15)'

	echo "======================="
	echo "Profile radius summary:"
	echo "======================="
	grep 'profile_radius' $input \| awk '{print $3}' \| sort -n > profile_radius.txt
	python2 -c 'from text_histogram import histogram; histogram([float(elem) for elem in open("profile_radius.txt").read().split("\n") if elem], buckets=15)'


	if [[ "$dorate" == "1" ]]; then
	# runs partialator to estimate rmeas and other foms
	partialator -i "$input" -o tmp.hkl --iterations "$iterations" -j "$j" --model "$model" --push-res "$pushres" -y "$symmetry" &> partialator.log
	rate
	elif [[ "$dorate" == "0" ]]; then
	if [[ "$scale" == "1" ]]; then
	process_hkl -i "$input" --min-cc "$mincc" --scale -o tmp.hkl -y "$symmetry" --min-res "$lowres" --push-res "$pushres"
	process_hkl -i "$input" --min-cc "$mincc" --scale -o tmp.hkl1 -y "$symmetry" --min-res "$lowres" --push-res "$pushres" --odd-only
	process_hkl -i "$input" --min-cc "$mincc" --scale -o tmp.hkl2 -y "$symmetry" --min-res "$lowres" --push-res "$pushres" --even-only
	else
	process_hkl -i "$input" --min-cc "$mincc" -o tmp.hkl -y "$symmetry" --min-res "$lowres" --push-res "$pushres"
	process_hkl -i "$input" --min-cc "$mincc" -o tmp.hkl1 -y "$symmetry" --min-res "$lowres" --push-res "$pushres" --odd-only
	process_hkl -i "$input" --min-cc "$mincc" -o tmp.hkl2 -y "$symmetry" --min-res "$lowres" --push-res "$pushres" --even-only
	fi
	rate
	else
	:
	fi



	if [[ "$dorate" == "-1" ]]; then
	# rate
	exit 0; fi

	echo "================"
	echo "Merging summary:"
	echo "================"
	echo "Merging stats backup file: $output"
	tail tmp.hkl \| tail -n 5 \| head -n 1


	echo "================" >> "$output"
	echo "Merging summary:" >> "$output"
	echo "================" >> "$output"
	tail tmp.hkl \| tail -n 5 \| head -n 1 >> "$output"
	cat overall_stats.csv >> "$output"


	rm stats[0-9].dat
	cat overall_stats.csv
	#!/usr/env/bin python

	from __future__ import print_function
	import os

	for elem in os.listdir("."):
	if "merging" in elem:
	continue
	if not os.path.isfile(elem + "/XDS.INP"):
	continue
	if os.path.isdir(elem):
	print(
	elem,
	"/".join([os.getcwd(), elem]),
	"/".join([os.getcwd(), elem, "/images"]),
	1,
	len(os.listdir("/".join([os.getcwd(), elem, "/images"]))) - 2,
	sep=",",
	)
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/analysis.sh
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1.lst
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1.pdb
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1_events.lst
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1_p21_2019_04_09_11_40_00.stream
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/c1_v1.pdb
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/check-near-bragg
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/initial-predrefine.geom
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/initial.geom
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/initial_v1.geom
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/initial_v2.geom
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/input.lst
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/overall_stats.csv.backup
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/runcrystfel.sh
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/streams.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/tmp.hkl
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/tmp.hkl1
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/tmp.hkl2
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/tmp.lst
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0012-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0013-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0014-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0015-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0016-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0017-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0018-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0019-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0020-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0058-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0059-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0060-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0061-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0062-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0063-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0064-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0065-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0066-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0067-cyslt1-nh4.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/106/C1_Zaf_P21/hdf5/r0119-lys3.tar
	https://cxidb.org/data/106/C1_Zaf_P21_stream.tar.gz
	https://www.cxidb.org/data/107/6RZ5_CysLT1R_stream.tar.gz
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0127-cyslt1-zaf.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0128-cyslt1-zaf.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0129-cyslt1-zaf.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0130-cyslt1-zaf.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0131-cyslt1-zaf.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0133-cyslt1-zaf.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0180-cyslt1-zaf.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_data/r0181-cyslt1-zaf.tar
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/analysis.sh
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/c1-zaf.cell
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/cxi-events.lst
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/cxi.lst
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/index.html
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/input.lst
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/overall_stats.csv
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_files.md5
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/raw_files_list.txt
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/refined.geom
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/runcrystfel.sh
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/streams.tar.gz
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/tmp.hkl
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/tmp.hkl1
	http://portal.nersc.gov/archive/home/projects/cxidb/www/107/6RZ5_C1_Zaf_P1/tmp.hkl2
	#!/bin/bash

	download(){
	local DATASET="$1"
	local file_list="$2"

	mkdir "$DATASET"
	while read url; do
	(cd "$DATASET" && curl -O "$url")
	done < "$file_list"
	}

	DATASETS=("cxidb_ID106_C1_Zaf_P21" "cxidb_ID107_C1_Zaf_P1" "zenodo_6RZ4_CysLT1R" "zenodo_6RZ6_CysLT2R" "zenodo_6RZ7_CysLT2R" "zenodo_6RZ8_CysLT2R" "zenodo_6RZ9_CysLT2R")
	LISTS=("cxidb_id106.txt" "cxidb_id107.txt" "zenodo_cyslt1r_6RZ4.txt" "zenodo_cyslt2r_6RZ6.txt" "zenodo_cyslt2r_6RZ7.txt" "zenodo_cyslt2r_6RZ8.txt" "zenodo_cyslt2r_6RZ9.txt")

	for i in `seq 0 1 7`; do
	echo "Downloading: ${DATASETS[$i]} ${LISTS[$i]}"
	download ${DATASETS[$i]} ${LISTS[$i]}
	done \|& tee download.log
	#!/usr/bin/python
	from __future__ import print_function
	import os
	import re
	from shutil import copyfile


	# looks for pattern in a string; returns True if found, False if not
	def Find(pat, text):
	match = re.search(pat, text)
	return match


	data_summary_table = "fin.csv"
	working_directory = os.getcwd() # assumes that you are already in processing folder


	space_group = "!SPACE_GROUP_NUMBER= 4 \n"
	unit_cell_constants = (
	"!UNIT_CELL_CONSTANTS= 59.22 45.66 86.77 90.000 91.275 90.000\n"
	)
	max_num_proc = "MAXIMUM_NUMBER_OF_PROCESSORS= 80 \n"
	max_num_jobs = "MAXIMUM_NUMBER_OF_JOBS= 48 \n"
	resolution_range = "INCLUDE_RESOLUTION_RANGE= 40 2.0 \n"

	# reference_data_set = ''
	reference_data_set = "REFERENCE_DATA_SET= %s \n" % "../reference.HKL"
	if len(reference_data_set) > 0:
	use_reference = True
	else:
	use_reference = False


	fin = open(data_summary_table).read().split("\n")
	log = open("log.express", "w")

	# reading input files into XDSs
	XDSs = dict()
	for index, string in enumerate(fin):
	try:
	name, data, inp, data_range_start, data_range_stop = string.split(",")
	except ValueError:
	print("Error while loading string %d:\t%s" % (index, string), file=log)
	continue
	XDSs[name] = [data, inp, data_range_start, data_range_stop]

	print("Following folders detected:\n", file=log)
	for name in XDSs.keys():
	print("%s\n \t%s\n \t%s\n\n" % (name, XDSs[name][0], XDSs[name][1]), file=log)


	# for each dataset folder does some stuff
	for name in XDSs.keys():
	os.chdir(working_directory)

	xycorr = XDSs[name][1]
	xds = XDSs[name][1] + "/XDS.INP"
	data = XDSs[name][0]
	data_range_start = XDSs[name][2]
	data_range_stop = XDSs[name][3]

	os.chdir(name)
	project_dir = os.getcwd()
	# remember current directory

	copyfile("../XSCALE.express.py.INP", "XSCALE.INP")

	xds = open("XDS.INP", "r")
	modif = xds.readlines()
	xds.close()

	job = False
	spotrange_first = False
	for i, string in enumerate(modif):
	noreference = True

	if False:
	pass

	elif Find("SPACE_GROUP_NUMBER=", string):
	modif[i] = space_group
	print("### Space group added")

	elif Find("UNIT_CELL_CONSTANTS=", string):
	# modif[i] = 'UNIT_CELL_CONSTANTS= 36.337 35.631 41.277 90.000 93.606 90.000\n'
	modif[i] = unit_cell_constants
	print("### Unit cell constants added")

	elif Find("JOB=", string):
	if job:
	modif[i] = "\n"
	else:
	modif[i] = "JOB=XYCORR INIT COLSPOT IDXREF\n"
	job = True
	print("### Job added")

	elif Find("JOB=", string) and job == True:
	modif[i] = "\n"

	elif Find("SECONDS=", string):
	modif[i] = "!" + string
	print("### Seconds added")

	elif Find("MAXIMUM_NUMBER_OF_PROCESSORS=", string):
	modif[i] = max_num_proc
	print("### Maximum number of processors added")

	elif Find("MAXIMUM_NUMBER_OF_JOBS=", string):
	modif[i] = max_num_jobs
	print("### Maximum number of jobs added")

	elif Find("X-GEO_CORR=", string):
	os.system("bzip2 -d %s" % (xycorr + "/x_geo_corr.cbf.bz2\n"))
	# modif[i] = 'X-GEO_CORR=%s'%(xycorr + '/x_geo_corr.cbf\n')
	modif[i] = "images/x_geo_corr.cbf\n"
	#
	elif Find("Y-GEO_CORR=", string):
	os.system("bzip2 -d %s" % (xycorr + "/y_geo_corr.cbf.bz2\n"))
	# modif[i] = 'Y-GEO_CORR=%s'%(xycorr + '/y_geo_corr.cbf\n')
	modif[i] = "images/y_geo_corr.cbf\n"

	elif Find("RESOLUTION_RANGE", string):
	modif[i] = resolution_range
	print("### Resolution range added")

	elif Find("LIB", string):
	modif[
	i
	] = "LIB=/home/marin/Apps/neggia/build/src/dectris/neggia/plugin/dectris-neggia.so \n"

	elif Find("REFERENCE_DATA_SET", string):
	modif[i] = ""
	if use_reference:
	modif[i] = reference_data_set
	noreference = False
	else:
	pass

	elif Find("SPOT_RANGE", string):
	modif[i] = "SPOT_RANGE= %s %s\n" % (data_range_start, data_range_stop)

	noreference = True
	if noreference:
	modif.append("REFERENCE_DATA_SET= ../reference.HKL\n")

	with open("XDS.INP", "w") as xds:
	xds.writelines(modif)

	xds.close()

	os.system("xds_par")
	os.system("xscale_par")
	os.system("cp XDS_ASCII.HKL XDS_ASCII.HKL_old")

	os.system("cp GXPARM.XDS XPARM.XDS")
	os.system("mv CORRECT.LP CORRECT.LP.old")
	os.system("mv XSCALE.LP XSCALE.LP.old")
	os.system("egrep -v 'JOB\|REIDX' XDS.INP > XDS.INP.new")
	os.system(
	'echo "! JOB=XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE CORRECT" > XDS.INP'
	)
	os.system('echo "JOB=DEFPIX INTEGRATE CORRECT" >> XDS.INP')
	os.system("cat XDS.INP.new >> XDS.INP")
	os.system("xds_par")
	os.system("xscale_par")


	os.system("ls -alt")
	log.close()
	#!/bin/bash

	cp XSCALE.INP XSCALE.INP.reject_backup
	echo "MAXIMUM_NUMBER_OF_PROCESSORS= 80" >> xscale.inp
	ls */XDS_ASCII.HKL > xscale.inp
	sed -e 's/^/INPUT_FILE= /g' xscale.inp \| sed -e 's/$/\nINCLUDE_RESOLUTION_RANGE= 30 2.5/' > XSCALE.INP.reject_0
	echo "OUTPUT_FILE= scaled_nonmerged.HKL" > XSCALE.INP
	echo "MERGE= FALSE" >> XSCALE.INP
	echo "!REFERENCE_DATA_SET= reference.HKL" >> XSCALE.INP
	echo "" >> XSCALE.INP
	cat XSCALE.INP.reject_0 >> XSCALE.INP

	cp XSCALE.INP XSCALE.INP.reject_0

	for i in `seq 1 1 5`; do
	xscale_par
	cp XSCALE.LP{,_$i}
	./xdscc12 scaled_nonmerged.HKL -dmin 30.0 -dmax 10.0 -nbin 7 > XDSCC.LP
	python xdscc.py XDSCC.LP 1.0 \|& tee log.xdscc_"$i"
	cp good.xdscc xscale.inp_"$i"
	sed -e 's/^/INPUT_FILE= /g' xscale.inp_"$i" \| sed -e 's/$/\nINCLUDE_RESOLUTION_RANGE= 30 2.5/' > XSCALE.INP.reject_"$i"
	echo "MAXIMUM_NUMBER_OF_PROCESSORS= 80" >> XSCALE.INP
	echo "OUTPUT_FILE= scaled_nonmerged.HKL" > XSCALE.INP
	echo "MERGE= FALSE" >> XSCALE.INP
	echo "!REFERENCE_DATA_SET= reference.HKL" >> XSCALE.INP
	echo "" >> XSCALE.INP
	cat XSCALE.INP.reject_"$i" >> XSCALE.INP
	cp scaled_nonmerged.HKL reference.HKL
	done

	for i in `seq 6 1 9`; do
	xscale_par
	cp XSCALE.LP{,_$i}
	./xdscc12 scaled_nonmerged.HKL -dmin 10.0 -dmax 2.5 -nbin 15 > XDSCC.LP
	python xdscc.py XDSCC.LP 1.0 \|& tee log.xdscc_"$i"
	cp good.xdscc xscale.inp_"$i"
	sed -e 's/^/INPUT_FILE= /g' xscale.inp_"$i" \| sed -e 's/$/\nINCLUDE_RESOLUTION_RANGE= 30 2.5/' > XSCALE.INP.reject_"$i"
	echo "MAXIMUM_NUMBER_OF_PROCESSORS= 80" >> XSCALE.INP
	echo "OUTPUT_FILE= scaled_nonmerged.HKL" > XSCALE.INP
	echo "MERGE= FALSE" >> XSCALE.INP
	echo "!REFERENCE_DATA_SET= reference.HKL" >> XSCALE.INP
	echo "" >> XSCALE.INP
	cat XSCALE.INP.reject_"$i" >> XSCALE.INP
	cp scaled_nonmerged.HKL reference.HKL
	done
	#!/bin/bash

	time=$(date "+%Y_%m_%d_%H_%M_%S")

	PROJECT_NAME="protein"
	NPROC=`nproc`

	# PEAK FINDING PARAMETERS
	SNR='4.4'
	THRESHOLD='20'
	HIGHRES='3.0'

	LST='c1_events.lst'
	CELL='c1_v1.pdb'

	#shuf "$LST" \| head -n 1000 > input.lst # your list must have events to enable this
	shuf "$LST" > input.lst # your list must have events to enable this

	GEOM="initial_v1.geom"


	ln -f -s "streams/"$PROJECT_NAME"_${time}.stream" laststream
	indexamajig -i input.lst \
	--temp-dir=scratch \
	-o "streams/"$PROJECT_NAME"_${time}.stream" \
	\
	-g "$GEOM" \
	--peaks=peakfinder8 \
	-j "$NPROC" \
	--min-snr="$SNR" \
	--threshold="$THRESHOLD" \
	--highres="$HIGHRES" \
	--max-res=300 \
	--min-res=80 \
	\
	-p "$CELL" \
	--check-peaks \
	\
	--multi \
	--indexing=dirax,xds,asdf,taketwo,xgandalf \|& tee logs/log.indexamajig_${time}
	#!/usr/bin/env python

	from __future__ import print_function
	import os
	import sys
	import re

	# USAGE EXAMPLE:
	# xscale_par; grep 'Nano' -A 25 XSCALE.LP; xdscc12 scaled_nonmerged.HKL -dmin 5.0 -dmax 2.5 -nbin 10 > XDSCC.LP; python xdscc.py XDSCC.LP

	fin = open("XDSCC.LP")
	fin = open(sys.argv[1])
	fin_re = open(sys.argv[1])
	try:
	cutoff = float(sys.argv[2])
	except IndexError:
	cutoff = 1.0


	def fill(string, N=15):
	"Pads string with spaces up to length N"
	if len(string) > N:
	return string[:N]
	else:
	return string + " " * (N - len(string))


	def get_rejected_crystals(xdscclp, rejection_func=None, mode="noano"):
	"""
	Parsing of XDSCC12.LP file using rejection critecia 'rejection_func'.
	Returns set of numbers -- the bad crystals with respect to numbering in
	the initial xdscclp file.
	"""

	fin = open(xdscclp)
	bad_crystals = set()
	if mode == "noano":
	a = re.compile("^a\s+")
	b = re.compile("^b\s+")
	c = re.compile("^c\s+")
	elif mode == "ano":
	a = re.compile("^d\s+")
	b = re.compile("^e\s+")
	c = re.compile("^f\s+")
	else:
	print("Wrong mode given to get_rejected_crystals: %s" % mode)
	sys.exit(1)

	while True:
	fline = fin.readline()
	if not fline:
	break

	if a.match(fline):
	crystal_number = int(fline.split()[1])
	if crystal_number % 100 == 0:
	print("Working with crystal number %d" % crystal_number, end="\r")

	elif b.match(fline):
	try:
	CC = [float(i) for i in fline.split()[1:]]
	except ValueError:
	# print("Unusual pattern while parsing CC in %s"%fline)
	CC = resolving(fline.replace("-100", " -100"))
	CC = [i for i in CC if i != 0.0]

	elif c.match(fline):
	Nref = [int(i) for i in fline.split()[1:]]

	CCaverage = sum([i for i in CC]) / len(CC)
	if CCaverage < 0 and sum(Nref) / len(Nref) > 10:
	bad_crystals.add(crystal_number)

	# returns set()
	return bad_crystals


	# expressins to parse HKL file
	dataset = lambda string: " ISET=" in string and "INPUT_FILE" in string
	reflection_file = lambda string: "reflection file is" in string

	getnamesfrom = [i for i in fin_re.readlines() if reflection_file(i)][0].split()[-1]
	getnamesfrom = open(getnamesfrom)
	fin_re.close()

	datasets_from_xscale = dict()
	i = 1

	for fline in getnamesfrom.readlines():
	if dataset(fline):
	# print(fline,end='')
	datasets_from_xscale[i] = {"name": fline.split("INPUT_FILE=")[-1][:-1]}
	i += 1

	getnamesfrom.close()


	# expressions to parse XDSCC.LP

	resolution_shells = lambda string: "resolution shells (for lines starting" in string
	abcdef = re.compile("^[abcdef]\s+", re.M)

	next_shells = False
	for fline in fin.readlines():
	if resolution_shells(fline):
	next_shells = True
	continue
	elif next_shells:
	shells = [float(i) for i in fline.split() if i]
	next_shells = False

	if "overall" in fline:
	j = 0
	elif abcdef.match(fline):
	current_type, numbers = fline.split()[0], [float(i) for i in fline.split()[1:]]
	if current_type == "a" or current_type == "d":
	j += 1
	datasets_from_xscale[j][current_type] = numbers


	iterxds = True
	try:
	fin = open("iterxds.log")
	except IOError:
	iterxds = False

	if iterxds:
	i = 1
	for fline in fin.readlines():
	if "overall" in fline and len(fline.split()) > 3:
	rmeas_overall_low = fline.replace("%", "").split()[0]
	elif "XDS_ASCII" in fline:
	name = fline.split()[-1]
	rmeas_low, rmeas_overall = fline.replace("%", "").split()[:2]
	rmeas_low = float(rmeas_low)
	rmeas_overall = float(rmeas_overall)
	for key in datasets_from_xscale.keys():
	if datasets_from_xscale[key]["name"] == name:
	datasets_from_xscale[key]["rmeas_low"] = rmeas_low
	datasets_from_xscale[key]["rmeas_overall"] = rmeas_overall


	fout = open("good.xdscc", "w")

	padding_length = max([len(elem["name"]) for elem in datasets_from_xscale.values()])
	toprint = ["--\t%s\t %8.2f" % ("-" * padding_length, 0)]
	for key in datasets_from_xscale.keys():
	name = datasets_from_xscale[key]["name"]
	CCnoano = datasets_from_xscale[key]["b"]
	Nrefsnoano = datasets_from_xscale[key]["c"]
	# CCano = datasets_from_xscale[key]['e']
	# Nrefsano = datasets_from_xscale[key]['f']
	if iterxds:
	rmeas_low = datasets_from_xscale[key]["rmeas_low"]
	rmeas_overall = datasets_from_xscale[key]["rmeas_overall"]
	toprint.append(
	"%d\t%s\t %8.2f\t%2.2f\t%2.2f"
	% (
	key,
	fill(name, N=padding_length),
	sum(CCnoano) / len(CCnoano),
	rmeas_low,
	rmeas_overall,
	)
	)
	else:
	toprint.append(
	"%d\t%s\t %8.2f"
	% (key, fill(name, N=padding_length), sum(CCnoano) / len(CCnoano))
	)
	if sum(CCnoano) / len(CCnoano) >= cutoff:
	print("%s" % name, file=fout)


	print(*sorted(toprint, key=lambda f: float(f.split()[2])), sep="\n")
	https://zenodo.org/record/3921911/files/6RZ4_C1_Pran.tar.gz?download=1
	https://zenodo.org/record/3921911/files/6RZ4_C1_Pran_hkls.tar.gz?download=1