Skip to content

Instantly share code, notes, and snippets.

Avatar

Marcus Fedarko fedarko

View GitHub Profile
@fedarko
fedarko / sort-rmdup-bbl.py
Last active August 31, 2022 08:11
Sort and remove duplicate BBL (bibtex file) entries; useful when combining multiple BBL files (e.g. if using the multibib package) into a single one
View sort-rmdup-bbl.py
#! /usr/bin/env python3
# NOTE: this is a hack, so it will probably break if you have BBL files that
# don't look like the natbib-generated ones I'm used to. It is also pretty
# unintelligent about *how* it sorts entries (it defers most of the work
# to python), so if you have cases where some of your references are by
# the same person or whatever then that might cause the output to not match
# your expectations.
import sys
@fedarko
fedarko / gfa-to-fasta.py
Created April 15, 2022 02:45
Convert GFA to FASTA
View gfa-to-fasta.py
#! /usr/bin/env python3
# Converts a GFA assembly graph to a FASTA file of all sequences
# within the graph. Notably, this ignores connections between sequences
# in the graph.
#
# Depends on Python 3.6 or later.
#
# Usage:
# $ ./gfa_to_fasta.py mygraph.gfa contigs.fasta
@fedarko
fedarko / handle_duplicate_sample_ids.py
Last active December 16, 2019 22:12
Script to report on duplicate IDs in a plate map spreadsheet (and modify certain duplicate IDs, in a very specific case); also attempts to update Qiita prep files accordingly. As a warning, code is untested / pretty gross.
View handle_duplicate_sample_ids.py
#! /usr/bin/env python3
import os
from collections import Counter
from math import ceil
import re
from numpy import argmax
import pandas as pd
from qiime2 import Metadata
# "Parameters" of this script
@fedarko
fedarko / find_missing_dates.py
Last active December 10, 2019 00:54
In a timeseries metadata file, finds all days that are not "represented" by at least one sample in the metadata
View find_missing_dates.py
#! /usr/bin/env python3
from dateutil.parser import parse
import pandas as pd
df = pd.read_csv("20191209_metadata.txt", sep="\t", index_col=0)
# Subset to a certain host subject ID, if desired
df = df[df["host_subject_id"] == "M03"]
@fedarko
fedarko / negative_control_stats.py
Last active October 27, 2019 23:55
Search taxonomies of negative controls
View negative_control_stats.py
#! /usr/bin/env python3
"""
This is a small script that looks through the annotated taxonomies of all
features present in a dataset's negative control samples. It's handy for
checking that certain features are (for the most part) absent from these
samples.
This obviously isn't a very formal way of accounting for contamination,
but it is useful for quickly verifying that certain taxa are probably not
the product of contamination. (Better approaches include e.g. the decontam
@fedarko
fedarko / convert_timestamp_to_ordinal_date.py
Created October 8, 2019 22:26
adds an ordinal date field based on a timestamp to a q2 metadata file
View convert_timestamp_to_ordinal_date.py
#! /usr/bin/env python3
from qiime2 import Metadata
from dateutil.parser import parse
from dateutil.relativedelta import relativedelta
m = Metadata.load("metadata-with-age.tsv")
m_df = m.to_dataframe()
m_df["ordinal-timestamp"] = 0
@fedarko
fedarko / convert_timestamp_to_days_elapsed.py
Created October 5, 2019 01:02
For a QIIME 2-formatted sample metadata file, uses the collection_timestamp field to assign samples a "days since first day" field. This field is useful for visualizations like q2-longitudinal's volatility plots.
View convert_timestamp_to_days_elapsed.py
#! /usr/bin/env python3
from qiime2 import Metadata
from dateutil.parser import parse
m = Metadata.load("metadata-with-age-and-ordinal-timestamp.tsv")
m_df = m.to_dataframe()
# Compute earliest date
min_date = None
for sample_id in m_df.index:
@fedarko
fedarko / gh_url_to_raw_gh_url.py
Created October 2, 2019 22:10
Convert a github file URL to a raw.githubusercontent.com URL (that can be directly accessed for things like view.qiime2.org or wget)
View gh_url_to_raw_gh_url.py
# your link goes here
link = "https://github.com/knightlab-analyses/qurro-mackerel-analysis/blob/master/AnalysisOutput/qurro-plot.qzv"
# note: this will break if a repo/organization or subfolder is named "blob" -- would be ideal to use a fancy regex
# to be more precise here
print(link.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/"))
# example output link:
# https://raw.githubusercontent.com/knightlab-analyses/qurro-mackerel-analysis/master/AnalysisOutput/qurro-plot.qzv
@fedarko
fedarko / split_metadata_by_run.py
Last active September 26, 2019 22:27
Splits up a QIIME 2 metadata file into separate metadata files, such that there is one file per specified "run" column. This is useful if multiple samples from different runs share barcode sequences, which can make QIIME 2 angry.
View split_metadata_by_run.py
# NOTE: Assumes that there's a SAMPLE_METADATA environment variable declared pointing to a metadata file
# NOTE: Assumes that this metadata file contains BarcodeSequence and seq_run_ord columns
import pandas as pd
import os
md = pd.read_csv(os.environ["SAMPLE_METADATA"], sep="\t", index_col=0)
print("There are {} unique barcode sequences in this metadata file.".format(len(md["BarcodeSequence"].unique())))
runs = tuple(md["seq_run_ord"].unique())
print("Also, the {} runs listed in this metadata file are {}.".format(len(runs), runs))
@fedarko
fedarko / add_age_column_to_metadata.py
Last active August 30, 2019 06:25
Adds an "age in years" column to a QIIME 2 sample metadata file
View add_age_column_to_metadata.py
#! /usr/bin/env python3
from qiime2 import Metadata
from dateutil.parser import parse
from dateutil.relativedelta import relativedelta
subject_id = "HOST SUBJECT ID"
subject_birthday = "HOST BIRTHDAY"
subject_birthday_datetime = parse(subject_birthday)
age_col_name = "subject_age_years"