Last active
April 13, 2022 09:23
-
-
Save AlaaALatif/46468af0ae0730d3bae41ff81ce2aef0 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import bjorn_support as bs | |
import mutations as bm | |
# FASTA must include reference NC_045512.2 (e.g. use cat to add the reference) | |
fasta_filepath = '/valhalla/2021-02-08_release/msa/2021-02-08_release.fa' | |
# specify name for output alignment | |
msa_filepath = 'msa.fa' | |
# run alignment (uses MAFFT but can be changed from bjorn_support.py) | |
bs.align_fasta(fasta_filepath, msa_filepath); | |
# load alignment | |
msa_data = bs.load_fasta(msa_filepath, is_aligned=True) | |
# identify variants for each sample | |
# must identify insertions before anything else, otherwise information is lost | |
try: | |
insertions, _ = bm.identify_insertions_per_sample(msa_data) | |
except: | |
insertions = None | |
substitutions, _ = bm.identify_replacements_per_sample(msa_data) | |
deletions, _ = bm.identify_deletions_per_sample(msa_data) |
as a rough estimate of runtime, this takes a total of 43.7 seconds on 245 SARS-CoV-2 samples collected between December 2020 and February 2021 using 8 cores on a linux machine.
link to supporting code (bjorn): https://github.com/andersen-lab/bjorn
Writing myself a comment to remind myself: in the code snippet above, insertions
, substitutions
, and deletions
are just DataFrame
s, which can then be exported to a spreadsheet using the DataFrame.to_csv()
function
Hi Niema, please do let me know if there are any issues or lack of clarity
Thank you, will do! I appreciate it!
import bjorn_support as bs
ModuleNotFoundError: No module named 'bjorn_support'
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
script for tabulating mutations for each sample inside the input fasta file. mutations are computed relative to the reference sequence named 'NC_045512.2' and present inside the input fasta.