54mu/Extended materials and methods.md

## Extended materials and methods.md

      
    Raw
  

              Extended materials and methods.md
            
          
    Extended materials and methods

software list


modeltest-ng
mafft
augur
auspice

Sequence retrieval

The complete collection of SARS-CoV2 genomes and relative metadata were downloaded from gisaid in date 2020-04-05.
Data preparation

Quality filtering

1. Based on metadata

Sequences without collection date, incomplete or low coverage were removed with the augur filter subcommand.
augur filter --metadata metadata.tsv \
    --sequences lineage_BA_2.parsed.fasta \
    --query "Collection_date != '' and Is_complete == 'True' and     Is_low_coverage == ''" \
    --output filtered_sequences.fasta
2. Based on ambiguous nucleotide content

Genomes containing stretches of 'N' nucleotides longer than 25 were removed with a python script.
Sequence subsampling

The filtered fasta from the filtering step was subdivided and subsampled according to the following criteria:
BA.2 lineage with insertions

Two fasta files were generated by the sequences containing the two insertions of interest: Spike_ins213GRG and Spike_ins213VGGG.
From both of them were only kept sequences containing all other typical BA.2 point mutations (inclusion list) and not containing NSP1V88del, with the aim of obtaining the highest quality data for these subsets.
inclusion list

Spike_H655Y
N_R203K
Spike_Q954H
Spike_N679K
N_G204R
NSP13_R392C
Spike_P681H
NSP12_P323L
NSP3_G489S
N_P13L
NS3_T223I
NSP4_T492I
M_A63T
NSP4_L438F
E_T9I
Spike_D614G
NSP5_P132H
Spike_N969K
Spike_N764K
Spike_D796Y
N_S413R
NSP1_S135R
NSP4_T327I
NSP4_L264F
Spike_T19I
NSP3_T24I
NSP15_T112I
Spike_S373P
Spike_G339D
Spike_R408S
Spike_S375F
N_R32del
Spike_S371F
Spike_T376A
N_S33del
Spike_D405N
Spike_K417N
N_E31del
Spike_Y505H
Spike_T478K
Spike_Q498R
Spike_S477N
Spike_Q493R
Spike_N501Y
Spike_E484A
NSP6_G107del
NSP6_S106del
NSP6_F108del
Spike_G142D
NSP14_I42V
Spike_P26del
Spike_A27S
Spike_L24del
Spike_P25del
Spike_N440K

Other sequences

BA.2
: 600 sequences not containing the aforementioned insertions and pertaining to the BA.2 lineage were randomly sampled.
Seqeunces from VOCs
: 50 sequences from each VOC (except Omicron) were sampled
Non VOC sequences
: 150 sequences from the remaining strains and collected in the year 2020 were sampled
Alignment

As the default alignment from the augur pipeline removes insertions we manually used mafft, by setting the parameters according to the suggestions of the authors for closely-related viral genomes alignment. The wuhan reference genome (NC_045512 on NCBI) was set as reference sequence.
mafft --6merpair --thread -1 --nomemsave \
    --adjustdirection --addfragments \
    sequences.fasta NC_045512.2.fasta > alignment.fasta
Tree building and refinement

The substitution model was estimated with modeltest-ng.
The alignment was used as an input for the augur tree subcommand, also passing a guide tree built based on the insertions in the sequences used for the alignment in newick format, which can be summarized as:
((BA.2 without insertions), (BA.2 Spike_ins213GRG), (BA.2 Spike_ins213VGGG), wuhan_genome);

this exploits a feature in iqtree that allows to adjust topology based on assumptions of monophyly.
The augur command then is:
augur tree --alignment alignment.fasta \
    --output raw_tree.nwk --nthreads 24 \
    --tree-builder-args="-g guide_tree.nwk"
The inferred tree was date-refined with the augur refine subcommand:
augur refine --alignment alignment.fasta --tree tree/raw_tree.nwk \
    --metadata filtered_metadata.tsv \
    --output-tree efined_tree.nwk \
    --output-node-data node-data.json \
    --timetree --root "NC_045512.2" 
And finally the visualization for Auspice was produced with
augur export v2 --tree tree/refined_tree.nwk \
    --node-data tree/node-data.json  \
    --output treeViz_C.json \
    --color-by-metadata has_Spike_ins213 Pango_lineage Variant \
    --metadata filtered_metadata.tsv 
Availability of data

Scripts and raw files are available at this repository